Batch Job Analysis to Improve the Success Rate in HPC

Size: px
Start display at page:

Download "Batch Job Analysis to Improve the Success Rate in HPC"

Transcription

1 Batch Job Analysis to Improve the Success Rate in HPC 1 JunWeon Yoon, 2 TaeYoung Hong, 3 ChanYeol Park, 4 HeonChang Yu 1, First Author KISTI and Korea University, jwyoon@kisti.re.kr 2,3, KISTI,tyhong@kisti.re.kr,chan@kisti.re.kr * 4 Corresponding Author Korea University, yuhc@korea.ac.kr Abstract Tachyon is a supercomputer built by SUN(Oracle) at KISTI for use in a variety of science applications. It composed of 3,200 computing nodes and infra-facilities. In addition, this machine works with various software such as file system, compiler, debugger, parallel tools, etc. Among them, Tachyon uses, as a scheduler, Sun Grid Engine(SGE) to carry out the batch job of the cluster. It performs at a theoretical peak of 300 teraflops. In this paper, we analyzed the batch job logs which include information about the history of operations performed on SGE. In particular, we focused to distinguish the failed job log to find the cause of failure. Additionally, this failed log is separated by a problem from the user s actions and system errors. For starters, we can check a validity of the failure job log itself and some of case can be blocked in advance. By doing so, users can recognize the problem immediately without having to wait unnecessarily until the order. In the view of scheduler, it can reduce overall waiting time for users. 1. Introduction Keywords: HPC, Supercomputer, Batch job, Scheduler, SGE Tachyon is a high-performance parallel computing system which constructed based on SUN Blade It composed of 3,200 computing nodes and infrastructures such as login nodes, scheduler server, storage, archiving system etc. And this system is operating over the RedHat 5.3 and using Lustre [1] as a filesystem which are connected by InfiniBand [2]. Figure 1 shows a brief of Tachyon System [3]. Also, Tachyon uses Sun Grid Engine (hereinafter referred to as SGE ) as a distributed resource management scheduler [4]. When user submits a job to cluster, the qmaster in SGE sorts the jobs in order of priority [5]. Job priority is derived from scheduler polices [6]. The sorted list of pending jobs is assigned to job slots (CPU) in order of priority [7]. In this paper, we distinguished job logs from SGE Exit-code that contained information why jobs are stopped abnormally. Additionally, we separate the reason of ended job using Exit-code value. And then we can pick out the cause of failure and prevent the some jobs to fail in advance. Figure 1. Summary of Tachyon system Journal of Next Generation Information Technology(JNIT) Volume 4, Number 8, October

2 2. Batch Job Processing 2.1. Job execution log on SGE Simply, scheduler manages and controls batch job in order to share limited resources [8]. Many of those products are developed such as SGE, Torque, PBS, LoadLeveler and so on. As noted above, Tachyon system uses SGE for batch job scheduler. Figure 2 shows an overall composition of queue, slot, host and group on SGE. User can submit a lot of jobs without having to worry where to execute [9]. Some hosts can consist of host group as needed. Job slot is a minimum resource unit and regard as a CPU (core). Queues have job slots. The sorted list of pending jobs is assigned to job slots in order of priority [10]. Also, SGE provides execute information of jobs in various command. For one thing, using like this command qacct j #Job_ID can check logs of execution after the job ended like a Figure 3. In this log contains basic executed data such as queue type, job id, queue submit time, start time, end time, executed job status, etc. Figure 2. Host Group, Queue and Job Slots 2.2. Analysis of job log Figure 3. Executed job information from SGE In this paper, we analyzed job logs for year of 2012 as part of the all data. Based on in Figure 3 above, each executed job information stores again like as Table 1 after the job is completed in Tachyon. This data contain not only how job executed but also why job terminated. Simply, it is called Converted-log. Especially, the last two fields (No. 19, 20 of Table1) of Converted-log shows whether the job finished normally or not. The cause of the abnormally ended job is quite diverse such as forced termination by user, job submit script error, exceeds the wall time limit of job scheduler [11], SW and HW troubles, etc. Each reason has a code value like seeing Table 2. Table 1. Information of executed job log No. Property Value(example) No. Property Value(example) 1 DATE RUN(s) JOBID CPUS 80 3 GID na0*** 13 CPU USAGE UID r000*** 14 MEM USAGE 68 5 JOBNAME jj_007_*** 15 MAXVMEM QNAME ocean4special 16 STATUS D 7 SUBMIT(DATE) E-CPU 80 8 START(DATE) E-RUN(s) END(DATE) EXIT-CODE 0(11) 10 WAIT(s) FAILED 0(11) 163

3 Table 2. qacct -j failed Field Codes (from SGE) Code Description Meaning for Job 0 No failure Job ran, exited normally 1~11 Presumably before job, Before writing config, Before writing PID, On reading config file, Setting processor set, Before prolog, In Job could not be started prolog, Before prestart, In prestart, Before job 12 Before pestop Job ran, failed before calling PE stop procedure 13 In pestop Job ran, PE stop procedure failed 14 Before epilog Job ran, failed before calling epilog script 15 In epilog Job ran, failed in epilog script 16 Releasing processor set Job ran, processor set could not be released 24 Migrating (checkpointing jobs) Job ran, job will be migrated 25 Rescheduling Job ran, job will be rescheduled 26 Opening output file Job could not be started, stderr/stdout file could not be opened 27 Searching requested shell Job could not be started, shell not found 28 Changing to working directory Job could not be started, error changing to start directory 100 Assumedly after job Job ran, job killed by a signal If the code is 0, the operation ended normally. Using this converted log, we extracted exit codes and counted the number of each failure type. Figure 4 is a script for arrangement of exit codes. Additionally, we categorized the cause according to the failure type and classified factors as below Table 4 which just shows major part from all of the failed code in Tachyon system. Some failure reasons hard to track down. For jobs that run successfully, the qacct -j command output shows a value of 0 in the failed field, and the output shows the exit status of the job in the Exit_status field. In such cases, the failed field displays one of the code values listed in Table2. codefile=$(cat 2012*.sge awk '{ print $2,$19,$20 }' > sge.exit.code.${today}) # Extract job_id and error code from the convert log and create temporary file. cat sge.exit.code.${today} awk '{ print $2,$1 }' awk '{ print $2 }' > sge.exit.code.errors.jobid.${today} cat sge.exit.code.${today} awk '{ print $2,$1 }' egrep ", ^[1-9]" awk '{ print $2 }' > ${tmpdir}/sge.exit.code.errors.jobid.${today} # Extract Failed job (non-zero values) errorjobnum=$(wc -l sge.exit.code.errors.jobid.${today} awk '{ print $1 }') # the number of failed jobs totaljobnum=$(wc -l sge.exit.code.${today} awk '{ print $1 }') # total jobs successjobnum='expr $totaljobnum - $errorjobnum' # the number of success jobs errrate=$(awk -v x=$errorjobnum -v y=$totaljobnum 'BEGIN { printf "%.3f\n", x/y*100 }') succrate=$(awk -v z=$errrate 'BEGIN { printf "%.3f\n", 100-z }') echo "Number of Total Jobs = $totaljobnum, Number of Sucess Jobs = $successjobnum, Number of Failed Jobs = $errorjobnum" > error_rate_${datamonth}.log echo "Failed Rate = ($errorjobnum/$totaljobnum) * 100" >> $tmpdir/error_rate_${datamonth}.log Total Job Number = , Failed Job Number = Failed Rate = (104182/564746) * 100 Job Failed Rate : (81.552) % Figure 4. A script to separate reasons using failed codes (in 2012) 164

4 Table 3. Failed Codes from Tachyon System Value Reason Times Rate 0(*),100(*) or 100(*) 0(*),13(*) or 13(*) Job ran, job killed by a signal 72, % Job ran, PE stop procedure failed 18, % 27(*) Job could not be started, shell not found % 0(*),28(*) or 28(*) Job could not be started, error changing to start directory % 3. Strategies With the result of experiment, we can take steps for each failed code. As referred to earlier, these codes contain reasons why job become failed. There are some cases such as job submit-script error, wrong command, floating point exception, reference of invalid memory address, etc. After analyzing failed code of converted log, some types of failed code can prevent errors from user jobs in advance. In order to avoid errors of this user, we can modify pre and post processing of the prolog and epilog in SGE. Prolog and epilog are global scripts that can be invoked before and after any job. Suppose, as mentioned above, job submit-script error is one of the basic reasons to make a fail such as exceed a time limit and maximum number of jobs submitted, wrong command, parallel environment errors something like that. In this case, we can filter the error beforehand using prolog script without waiting until the job run. Figure 5 is a prolog script which filters out wrong condition of job submitscript. Plus, although some job successfully completed, but this result left failed log. In such a case, we modified epilog script so that it can be accurately reflected. normal ) # Quene name # If job condition does not fit to queue properties, warning and delete a job if [ $total_slots -lt 17 ]; then # property of minimum number of job echo "[ERROR] The CPU number of your job should be greater than or equal to 17 ${SSH} $SGE_O_HOST "echo '[ERROR] The CPU number of your job should range from 1 to 960 in exclusive2 queue.' write $USER $SSH_TTY" ${QDEL} $JOB_ID exit 1 elif [ $total_slots -gt 1568 ]; then # Property of maximum number of job echo "[ERROR] The CPU number of your job should be less than or equal to 1536 ${QDEL} $JOB_ID exit 1 fi if ( isdigit $hrt ) && [ $hrt -gt 48 ] ; then # Property of wall time limit(h_rt) echo "[ERROR] The wall time limit (h_rt) for normal queue is 48:00:00" ${SSH} $SGE_O_HOST "echo '[ERROR] The wall time limit (h_rt) for normal queue is 48:00:00' write $USER $SSH_TTY" ${QDEL} $JOB_ID exit 1 fi ;; Figure 5. Prolog script for prevention of user s job script errors 165

5 4. Simulation Through above steps, we carried out a simulation. In fact, several failed codes are not clear to distinguish the reason. As said, we fixed a prolog/epilog script and modified reflection of wrong job execution information. Figure 6 shows a status of job execution rate in First, we applied to failed codes about 13, 27, and 28. Figure 7 shows the simulation result after pre and post processing. And then, we can prevent some errors in advance. From the result showed in Figure 7, average of job success rate improves from 77.0% to 80.2% Figure 6. Job execution rate Figure 7. Fixed job execution rate 5. Conclusion and Future work In this paper, we analyzed actual logs of job execution using converted log in Tachyon system. For this work, we had to ferret out the reason through unpack the failed code from SGE. Those codes implies why job was ended abnormally to express the result numerically. Some of failed case can be blocked in advance using pre and post processing. So, we fixed prolog and epilog script in SGE. As a 166

6 result, users can find out the problem of job script before the job executes and we can reduce overall job waiting time. As I referred, some failed codes difficult to learn to cause of fail. Future step, we need to more researches about the relationship between failed code and signal from Linux. We can improve success rate of job. 6. References [1] F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang, and I. Huang, Understanding lustre filesystem internals, Oak Ridge National Lab, Technical Report ORNL/TM-2009/117, [2] G. Pfister, An Introduction to the InfiniBand Architecture ( IEEE Press, [3] National Institute of Supercomputing and Networking, KISTI, [4] Templeton, D., A Beginner s Guide to Sun Grid Engine 6.2, Whitepaper of Sun Microsystems, July [5] C. Chaubal, "Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System", Technical report, Sun BluePrints Online, Sun Microsystems, Inc., Santa Clara, CA, USA [6] J.H. Abawajy, An efficient adaptive scheduling policy for high-performance computing, Original Research Article Future Generation Computer Systems, Volume 25, Issue 3, pp , Mar [7] G. Cawood, T. Seed, R. Abrol, T. Sloan, "TGO & JOSH: Grid Scheduling with Grid Engine & Globus", Proceedings of the UK e-science All Hands Meetings, Nottingham, [8] Stillwell, M.; Vivien, F.; Casanova, H., "Dynamic Fractional Resource Scheduling versus Batch Scheduling," Parallel and Distributed Systems, IEEE Transactions, vol.23, no.3, pp , March [9] S. IqbalL, R. Gupta, Y.-C. Fang, Job Scheduling in HPC clusters, Dell Power Solutions, pp , [10] Stosser, J., Bodenbenner, P., See, S., Neumann, D., A Discriminatory Pay-as-Bid Mechanism for Efficient Scheduling in the Sun N1 Grid Engine, Hawaii International Conference on System Sciences, Proceedings of the 41st Annual, pages 382, [11] Kumar, Rajath, Sathish Vadhiyar. "Identifying Quick Starters: Towards an Integrated Framework for Efficient Predictions of Queue Waiting Times of Batch Parallel Jobs." Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, pp ,

Introduction to the SGE/OGS batch-queuing system

Introduction to the SGE/OGS batch-queuing system Grid Computing Competence Center Introduction to the SGE/OGS batch-queuing system Riccardo Murri Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich Oct. 6, 2011 The basic

More information

Grid Engine Users Guide. 2011.11p1 Edition

Grid Engine Users Guide. 2011.11p1 Edition Grid Engine Users Guide 2011.11p1 Edition Grid Engine Users Guide : 2011.11p1 Edition Published Nov 01 2012 Copyright 2012 University of California and Scalable Systems This document is subject to the

More information

How To Run A Tompouce Cluster On An Ipra (Inria) 2.5.5 (Sun) 2 (Sun Geserade) 2-5.4 (Sun-Ge) 2/5.2 (

How To Run A Tompouce Cluster On An Ipra (Inria) 2.5.5 (Sun) 2 (Sun Geserade) 2-5.4 (Sun-Ge) 2/5.2 ( Running Hadoop and Stratosphere jobs on TomPouce cluster 16 October 2013 TomPouce cluster TomPouce is a cluster of 20 calcula@on nodes = 240 cores Located in the Inria Turing building (École Polytechnique)

More information

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems Riccardo Murri, Sergio Maffioletti Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich

More information

Introduction to Sun Grid Engine (SGE)

Introduction to Sun Grid Engine (SGE) Introduction to Sun Grid Engine (SGE) What is SGE? Sun Grid Engine (SGE) is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems

More information

High-Performance Reservoir Risk Assessment (Jacta Cluster)

High-Performance Reservoir Risk Assessment (Jacta Cluster) High-Performance Reservoir Risk Assessment (Jacta Cluster) SKUA-GOCAD 2013.1 Paradigm 2011.3 With Epos 4.1 Data Management Configuration Guide 2008 2013 Paradigm Ltd. or its affiliates and subsidiaries.

More information

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine) Grid Engine Basics (Formerly: Sun Grid Engine) Table of Contents Table of Contents Document Text Style Associations Prerequisites Terminology What is the Grid Engine (SGE)? Loading the SGE Module on Turing

More information

Maxwell compute cluster

Maxwell compute cluster Maxwell compute cluster An introduction to the Maxwell compute cluster Part 1 1.1 Opening PuTTY and getting the course materials on to Maxwell 1.1.1 On the desktop, double click on the shortcut icon for

More information

SGE Roll: Users Guide. Version @VERSION@ Edition

SGE Roll: Users Guide. Version @VERSION@ Edition SGE Roll: Users Guide Version @VERSION@ Edition SGE Roll: Users Guide : Version @VERSION@ Edition Published Aug 2006 Copyright 2006 UC Regents, Scalable Systems Table of Contents Preface...i 1. Requirements...1

More information

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina High Performance Computing Facility Specifications, Policies and Usage Supercomputer Project Bibliotheca Alexandrina Bibliotheca Alexandrina 1/16 Topics Specifications Overview Site Policies Intel Compilers

More information

Parallel Processing using the LOTUS cluster

Parallel Processing using the LOTUS cluster Parallel Processing using the LOTUS cluster Alison Pamment / Cristina del Cano Novales JASMIN/CEMS Workshop February 2015 Overview Parallelising data analysis LOTUS HPC Cluster Job submission on LOTUS

More information

Quick Tutorial for Portable Batch System (PBS)

Quick Tutorial for Portable Batch System (PBS) Quick Tutorial for Portable Batch System (PBS) The Portable Batch System (PBS) system is designed to manage the distribution of batch jobs and interactive sessions across the available nodes in the cluster.

More information

KISTI Supercomputer TACHYON Scheduling scheme & Sun Grid Engine

KISTI Supercomputer TACHYON Scheduling scheme & Sun Grid Engine KISTI Supercomputer TACHYON Scheduling scheme & Sun Grid Engine 슈퍼컴퓨팅인프라지원실 윤 준 원 (jwyoon@kisti.re.kr) 2014.07.15 Scheduling (batch job processing) Distributed resource management Features of job schedulers

More information

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Miami University RedHawk Cluster Working with batch jobs on the Cluster Miami University RedHawk Cluster Working with batch jobs on the Cluster The RedHawk cluster is a general purpose research computing resource available to support the research community at Miami University.

More information

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Streamline Computing Linux Cluster User Training. ( Nottingham University) 1 Streamline Computing Linux Cluster User Training ( Nottingham University) 3 User Training Agenda System Overview System Access Description of Cluster Environment Code Development Job Schedulers Running

More information

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015 Work Environment David Tur HPC Expert HPC Users Training September, 18th 2015 1. Atlas Cluster: Accessing and using resources 2. Software Overview 3. Job Scheduler 1. Accessing Resources DIPC technicians

More information

LSKA 2010 Survey Report Job Scheduler

LSKA 2010 Survey Report Job Scheduler LSKA 2010 Survey Report Job Scheduler Graduate Institute of Communication Engineering {r98942067, r98942112}@ntu.edu.tw March 31, 2010 1. Motivation Recently, the computing becomes much more complex. However,

More information

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria Tutorial: Using WestGrid Drew Leske Compute Canada/WestGrid Site Lead University of Victoria Fall 2013 Seminar Series Date Speaker Topic 23 September Lindsay Sill Introduction to WestGrid 9 October Drew

More information

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System Young-Ho Kim, Eun-Ji Lim, Gyu-Il Cha, Seung-Jo Bae Electronics and Telecommunications

More information

Grid 101. Grid 101. Josh Hegie. grid@unr.edu http://hpc.unr.edu

Grid 101. Grid 101. Josh Hegie. grid@unr.edu http://hpc.unr.edu Grid 101 Josh Hegie grid@unr.edu http://hpc.unr.edu Accessing the Grid Outline 1 Accessing the Grid 2 Working on the Grid 3 Submitting Jobs with SGE 4 Compiling 5 MPI 6 Questions? Accessing the Grid Logging

More information

Grid Engine 6. Troubleshooting. BioTeam Inc. info@bioteam.net

Grid Engine 6. Troubleshooting. BioTeam Inc. info@bioteam.net Grid Engine 6 Troubleshooting BioTeam Inc. info@bioteam.net Grid Engine Troubleshooting There are two core problem types Job Level Cluster seems OK, example scripts work fine Some user jobs/apps fail Cluster

More information

How to Run Parallel Jobs Efficiently

How to Run Parallel Jobs Efficiently How to Run Parallel Jobs Efficiently Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education May 9, 2013 1 The big picture: running parallel jobs on Hoffman2

More information

SLURM Workload Manager

SLURM Workload Manager SLURM Workload Manager What is SLURM? SLURM (Simple Linux Utility for Resource Management) is the native scheduler software that runs on ASTI's HPC cluster. Free and open-source job scheduler for the Linux

More information

The RWTH Compute Cluster Environment

The RWTH Compute Cluster Environment The RWTH Compute Cluster Environment Tim Cramer 11.03.2013 Source: D. Both, Bull GmbH Rechen- und Kommunikationszentrum (RZ) How to login Frontends cluster.rz.rwth-aachen.de cluster-x.rz.rwth-aachen.de

More information

Advanced Techniques with Newton. Gerald Ragghianti Advanced Newton workshop Sept. 22, 2011

Advanced Techniques with Newton. Gerald Ragghianti Advanced Newton workshop Sept. 22, 2011 Advanced Techniques with Newton Gerald Ragghianti Advanced Newton workshop Sept. 22, 2011 Workshop Goals Gain independence Executing your work Finding Information Fixing Problems Optimizing Effectiveness

More information

An Introduction to High Performance Computing in the Department

An Introduction to High Performance Computing in the Department An Introduction to High Performance Computing in the Department Ashley Ford & Chris Jewell Department of Statistics University of Warwick October 30, 2012 1 Some Background 2 How is Buster used? 3 Software

More information

HPC system startup manual (version 1.30)

HPC system startup manual (version 1.30) HPC system startup manual (version 1.30) Document change log Issue Date Change 1 12/1/2012 New document 2 10/22/2013 Added the information of supported OS 3 10/22/2013 Changed the example 1 for data download

More information

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform Mitglied der Helmholtz-Gemeinschaft System monitoring with LLview and the Parallel Tools Platform November 25, 2014 Carsten Karbach Content 1 LLview 2 Parallel Tools Platform (PTP) 3 Latest features 4

More information

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007 PBS Tutorial Fangrui Ma Universit of Nebraska-Lincoln October 26th, 2007 Abstract In this tutorial we gave a brief introduction to using PBS Pro. We gave examples on how to write control script, and submit

More information

Installing and running COMSOL on a Linux cluster

Installing and running COMSOL on a Linux cluster Installing and running COMSOL on a Linux cluster Introduction This quick guide explains how to install and operate COMSOL Multiphysics 5.0 on a Linux cluster. It is a complement to the COMSOL Installation

More information

Submitting batch jobs Slurm on ecgate. Xavi Abellan xavier.abellan@ecmwf.int User Support Section

Submitting batch jobs Slurm on ecgate. Xavi Abellan xavier.abellan@ecmwf.int User Support Section Submitting batch jobs Slurm on ecgate Xavi Abellan xavier.abellan@ecmwf.int User Support Section Slide 1 Outline Interactive mode versus Batch mode Overview of the Slurm batch system on ecgate Batch basic

More information

MPI / ClusterTools Update and Plans

MPI / ClusterTools Update and Plans HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski

More information

The SUN ONE Grid Engine BATCH SYSTEM

The SUN ONE Grid Engine BATCH SYSTEM The SUN ONE Grid Engine BATCH SYSTEM Juan Luis Chaves Sanabria Centro Nacional de Cálculo Científico (CeCalCULA) Latin American School in HPC on Linux Cluster October 27 November 07 2003 What is SGE? Is

More information

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St

More information

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014 Using WestGrid Patrick Mann, Manager, Technical Operations Jan.15, 2014 Winter 2014 Seminar Series Date Speaker Topic 5 February Gino DiLabio Molecular Modelling Using HPC and Gaussian 26 February Jonathan

More information

Cluster@WU User s Manual

Cluster@WU User s Manual Cluster@WU User s Manual Stefan Theußl Martin Pacala September 29, 2014 1 Introduction and scope At the WU Wirtschaftsuniversität Wien the Research Institute for Computational Methods (Forschungsinstitut

More information

Chapter 2: Getting Started

Chapter 2: Getting Started Chapter 2: Getting Started Once Partek Flow is installed, Chapter 2 will take the user to the next stage and describes the user interface and, of note, defines a number of terms required to understand

More information

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing

More information

Using the Yale HPC Clusters

Using the Yale HPC Clusters Using the Yale HPC Clusters Stephen Weston Robert Bjornson Yale Center for Research Computing Yale University Oct 2015 To get help Send an email to: hpc@yale.edu Read documentation at: http://research.computing.yale.edu/hpc-support

More information

Beyond Windows: Using the Linux Servers and the Grid

Beyond Windows: Using the Linux Servers and the Grid Beyond Windows: Using the Linux Servers and the Grid Topics Linux Overview How to Login & Remote Access Passwords Staying Up-To-Date Network Drives Server List The Grid Useful Commands Linux Overview Linux

More information

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - - The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - - Hadoop Implementation on Riptide 2 Table of Contents Executive

More information

Introduction to Grid Engine

Introduction to Grid Engine Introduction to Grid Engine Workbook Edition 8 January 2011 Document reference: 3609-2011 Introduction to Grid Engine for ECDF Users Workbook Introduction to Grid Engine for ECDF Users Author: Brian Fletcher,

More information

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology Volume 1.0 FACULTY OF CUMPUTER SCIENCE & ENGINEERING Ghulam Ishaq Khan Institute of Engineering Sciences & Technology User Manual For HPC Cluster at GIKI Designed and prepared by Faculty of Computer Science

More information

High-Performance Computing

High-Performance Computing High-Performance Computing Windows, Matlab and the HPC Dr. Leigh Brookshaw Dept. of Maths and Computing, USQ 1 The HPC Architecture 30 Sun boxes or nodes Each node has 2 x 2.4GHz AMD CPUs with 4 Cores

More information

GRID Computing: CAS Style

GRID Computing: CAS Style CS4CC3 Advanced Operating Systems Architectures Laboratory 7 GRID Computing: CAS Style campus trunk C.I.S. router "birkhoff" server The CAS Grid Computer 100BT ethernet node 1 "gigabyte" Ethernet switch

More information

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27. Linux für bwgrid Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 27. June 2011 Richling/Kredel (URZ/RUM) Linux für bwgrid FS 2011 1 / 33 Introduction

More information

Introduction to parallel computing and UPPMAX

Introduction to parallel computing and UPPMAX Introduction to parallel computing and UPPMAX Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 22, 2011 Parallel computing Parallel computing is becoming increasingly

More information

High Performance Computing

High Performance Computing High Performance Computing at Stellenbosch University Gerhard Venter Outline 1 Background 2 Clusters 3 SU History 4 SU Cluster 5 Using the Cluster 6 Examples What is High Performance Computing? Wikipedia

More information

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource PBS INTERNALS PBS & TORQUE PBS (Portable Batch System)-software system for managing system resources on workstations, SMP systems, MPPs and vector computers. It was based on Network Queuing System (NQS)

More information

Grid Engine. Application Integration

Grid Engine. Application Integration Grid Engine Application Integration Getting Stuff Done. Batch Interactive - Terminal Interactive - X11/GUI Licensed Applications Parallel Jobs DRMAA Batch Jobs Most common What is run: Shell Scripts Binaries

More information

Guideline for stresstest Page 1 of 6. Stress test

Guideline for stresstest Page 1 of 6. Stress test Guideline for stresstest Page 1 of 6 Stress test Objective: Show unacceptable problems with high parallel load. Crash, wrong processing, slow processing. Test Procedure: Run test cases with maximum number

More information

Matlab on a Supercomputer

Matlab on a Supercomputer Matlab on a Supercomputer Shelley L. Knuth Research Computing April 9, 2015 Outline Description of Matlab and supercomputing Interactive Matlab jobs Non-interactive Matlab jobs Parallel Computing Slides

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 3 Grid Schedulers: Condor, Sun Grid Engine 2010-2011 Introduction. Up until now Definition of Cloud Computing. Grid Computing: Schedulers: Condor architecture. 1 Summary Condor:

More information

Heterogeneous Clustering- Operational and User Impacts

Heterogeneous Clustering- Operational and User Impacts Heterogeneous Clustering- Operational and User Impacts Sarita Salm Sterling Software MS 258-6 Moffett Field, CA 94035.1000 sarita@nas.nasa.gov http :llscience.nas.nasa.govl~sarita ABSTRACT Heterogeneous

More information

Getting Started with HPC

Getting Started with HPC Getting Started with HPC An Introduction to the Minerva High Performance Computing Resource 17 Sep 2013 Outline of Topics Introduction HPC Accounts Logging onto the HPC Clusters Common Linux Commands Storage

More information

HP-UX Essentials and Shell Programming Course Summary

HP-UX Essentials and Shell Programming Course Summary Contact Us: (616) 875-4060 HP-UX Essentials and Shell Programming Course Summary Length: 5 Days Prerequisite: Basic computer skills Recommendation Statement: Student should be able to use a computer monitor,

More information

Using Parallel Computing to Run Multiple Jobs

Using Parallel Computing to Run Multiple Jobs Beowulf Training Using Parallel Computing to Run Multiple Jobs Jeff Linderoth August 5, 2003 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 1 Outline Introduction to Scheduling Software The

More information

Manual for using Super Computing Resources

Manual for using Super Computing Resources Manual for using Super Computing Resources Super Computing Research and Education Centre at Research Centre for Modeling and Simulation National University of Science and Technology H-12 Campus, Islamabad

More information

Agenda. Using HPC Wales 2

Agenda. Using HPC Wales 2 Using HPC Wales Agenda Infrastructure : An Overview of our Infrastructure Logging in : Command Line Interface and File Transfer Linux Basics : Commands and Text Editors Using Modules : Managing Software

More information

Introduction to Supercomputing with Janus

Introduction to Supercomputing with Janus Introduction to Supercomputing with Janus Shelley Knuth shelley.knuth@colorado.edu Peter Ruprecht peter.ruprecht@colorado.edu www.rc.colorado.edu Outline Who is CU Research Computing? What is a supercomputer?

More information

Sun Grid Engine, a new scheduler for EGEE

Sun Grid Engine, a new scheduler for EGEE Sun Grid Engine, a new scheduler for EGEE G. Borges, M. David, J. Gomes, J. Lopez, P. Rey, A. Simon, C. Fernandez, D. Kant, K. M. Sephton IBERGRID Conference Santiago de Compostela, Spain 14, 15, 16 May

More information

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/ CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1. Introduction

More information

Grid Engine 6. Policies. BioTeam Inc. info@bioteam.net

Grid Engine 6. Policies. BioTeam Inc. info@bioteam.net Grid Engine 6 Policies BioTeam Inc. info@bioteam.net This module covers High level policy config Reservations Backfilling Resource Quotas Advanced Reservation Job Submission Verification We ll be talking

More information

Active/Active HA Job Scheduler and Resource Management

Active/Active HA Job Scheduler and Resource Management Active/Active HA Job Scheduler and Resource Management a proof-of-concept implementation Kai Uhlemann Oak Ridge National Laboratory, USA University of Reading, UK University of Applied Sciences Berlin,

More information

TOG & JOSH: Grid scheduling with Grid Engine & Globus

TOG & JOSH: Grid scheduling with Grid Engine & Globus TOG & JOSH: Grid scheduling with Grid Engine & Globus G. Cawood, T. Seed, R. Abrol, T. Sloan EPCC, The University of Edinburgh, James Clerk Maxwell Building, Mayfield Road, Edinburgh, EH9 3JZ, UK Abstract

More information

Efficient cluster computing

Efficient cluster computing Efficient cluster computing Introduction to the Sun Grid Engine (SGE) queuing system Markus Rampp (RZG, MIGenAS) MPI for Evolutionary Anthropology Leipzig, Feb. 16, 2007 Outline Introduction Basic concepts:

More information

Grid Scheduling Dictionary of Terms and Keywords

Grid Scheduling Dictionary of Terms and Keywords Grid Scheduling Dictionary Working Group M. Roehrig, Sandia National Laboratories W. Ziegler, Fraunhofer-Institute for Algorithms and Scientific Computing Document: Category: Informational June 2002 Status

More information

Cloud Implementation using OpenNebula

Cloud Implementation using OpenNebula Cloud Implementation using OpenNebula Best Practice Document Produced by the MARnet-led working group on campus networking Authors: Vasko Sazdovski (FCSE/MARnet), Boro Jakimovski (FCSE/MARnet) April 2016

More information

How to Use NoMachine 4.4

How to Use NoMachine 4.4 How to Use NoMachine 4.4 Using NoMachine What is NoMachine and how can I use it? NoMachine is a software that runs on multiple platforms (ie: Windows, Mac, and Linux). It is an end user client that connects

More information

Cluster Computing With R

Cluster Computing With R Cluster Computing With R Stowers Institute for Medical Research R/Bioconductor Discussion Group Earl F. Glynn Scientific Programmer 18 December 2007 1 Cluster Computing With R Accessing Linux Boxes from

More information

Job Scheduling with Moab Cluster Suite

Job Scheduling with Moab Cluster Suite Job Scheduling with Moab Cluster Suite IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D. yjw@us.ibm.com 2/22/2010 Workload Manager Torque Source: Adaptive Computing 2 Some terminology..

More information

MapReduce Evaluator: User Guide

MapReduce Evaluator: User Guide University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview

More information

LoadLeveler Overview. January 30-31, 2012. IBM Storage & Technology Group. IBM HPC Developer Education @ TIFR, Mumbai

LoadLeveler Overview. January 30-31, 2012. IBM Storage & Technology Group. IBM HPC Developer Education @ TIFR, Mumbai IBM HPC Developer Education @ TIFR, Mumbai IBM Storage & Technology Group LoadLeveler Overview January 30-31, 2012 Pidad D'Souza (pidsouza@in.ibm.com) IBM, System & Technology Group 2009 IBM Corporation

More information

System Software for High Performance Computing. Joe Izraelevitz

System Software for High Performance Computing. Joe Izraelevitz System Software for High Performance Computing Joe Izraelevitz Agenda Overview of Supercomputers Blue Gene/Q System LoadLeveler Job Scheduler General Parallel File System HPC at UR What is a Supercomputer?

More information

Analyzing cluster log files using Logsurfer

Analyzing cluster log files using Logsurfer Analyzing cluster log files using Logsurfer James E. Prewett The Center for High Performance Computing at UNM (HPC@UNM) Abstract. Logsurfer is a log file analysis tool that simplifies cluster maintenance

More information

Grid Engine Administration. Overview

Grid Engine Administration. Overview Grid Engine Administration Overview This module covers Grid Problem Types How it works Distributed Resource Management Grid Engine 6 Variants Grid Engine Scheduling Grid Engine 6 Architecture Grid Problem

More information

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda High Performance Computing with Sun Grid Engine on the HPSCC cluster Fernando J. Pineda HPSCC High Performance Scientific Computing Center (HPSCC) " The Johns Hopkins Service Center in the Dept. of Biostatistics

More information

Batch Scripts for RA & Mio

Batch Scripts for RA & Mio Batch Scripts for RA & Mio Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Jobs are Run via a Batch System Ra and Mio are shared resources Purpose: Give fair access to all users Have control over where jobs

More information

Reverse Auction-based Resource Allocation Policy for Service Broker in Hybrid Cloud Environment

Reverse Auction-based Resource Allocation Policy for Service Broker in Hybrid Cloud Environment Reverse Auction-based Resource Allocation Policy for Service Broker in Hybrid Cloud Environment Sunghwan Moon, Jaekwon Kim, Taeyoung Kim, Jongsik Lee Department of Computer and Information Engineering,

More information

Distributed Operating Systems. Cluster Systems

Distributed Operating Systems. Cluster Systems Distributed Operating Systems Cluster Systems Ewa Niewiadomska-Szynkiewicz ens@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of Technology E&IT Department, WUT 1 1. Cluster

More information

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu Debugging and Profiling Lab Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu Setup Login to Ranger: - ssh -X username@ranger.tacc.utexas.edu Make sure you can export graphics

More information

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt. SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!

More information

ABAQUS High Performance Computing Environment at Nokia

ABAQUS High Performance Computing Environment at Nokia ABAQUS High Performance Computing Environment at Nokia Juha M. Korpela Nokia Corporation Abstract: The new commodity high performance computing (HPC) hardware together with the recent ABAQUS performance

More information

HPC Update: Engagement Model

HPC Update: Engagement Model HPC Update: Engagement Model MIKE VILDIBILL Director, Strategic Engagements Sun Microsystems mikev@sun.com Our Strategy Building a Comprehensive HPC Portfolio that Delivers Differentiated Customer Value

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing www.ijcsi.org 227 Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing Dhuha Basheer Abdullah 1, Zeena Abdulgafar Thanoon 2, 1 Computer Science Department, Mosul University,

More information

Operations Management Software for the K computer

Operations Management Software for the K computer Operations Management Software for the K computer Kouichi Hirai Yuji Iguchi Atsuya Uno Motoyoshi Kurokawa Supercomputer systems have been increasing steadily in scale (number of CPU cores and number of

More information

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354 159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1

More information

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research ! Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research! Cynthia Cornelius! Center for Computational Research University at Buffalo, SUNY! cdc at

More information

The CNMS Computer Cluster

The CNMS Computer Cluster The CNMS Computer Cluster This page describes the CNMS Computational Cluster, how to access it, and how to use it. Introduction (2014) The latest block of the CNMS Cluster (2010) Previous blocks of the

More information

Sun Grid Engine, a new scheduler for EGEE middleware

Sun Grid Engine, a new scheduler for EGEE middleware Sun Grid Engine, a new scheduler for EGEE middleware G. Borges 1, M. David 1, J. Gomes 1, J. Lopez 2, P. Rey 2, A. Simon 2, C. Fernandez 2, D. Kant 3, K. M. Sephton 4 1 Laboratório de Instrumentação em

More information

Scheduling in SAS 9.3

Scheduling in SAS 9.3 Scheduling in SAS 9.3 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. Scheduling in SAS 9.3. Cary, NC: SAS Institute Inc. Scheduling in SAS 9.3

More information

Program Grid and HPC5+ workshop

Program Grid and HPC5+ workshop Program Grid and HPC5+ workshop 24-30, Bahman 1391 Tuesday Wednesday 9.00-9.45 9.45-10.30 Break 11.00-11.45 11.45-12.30 Lunch 14.00-17.00 Workshop Rouhani Karimi MosalmanTabar Karimi G+MMT+K Opening IPM_Grid

More information

IBM Redistribute Big SQL v4.x Storage Paths IBM. Redistribute Big SQL v4.x Storage Paths

IBM Redistribute Big SQL v4.x Storage Paths IBM. Redistribute Big SQL v4.x Storage Paths Redistribute Big SQL v4.x Storage Paths THE GOAL The Big SQL temporary tablespace is used during high volume queries to spill sorts or intermediate data to disk. To improve I/O performance for these queries,

More information

NEC HPC-Linux-Cluster

NEC HPC-Linux-Cluster NEC HPC-Linux-Cluster Hardware configuration: 4 Front-end servers: each with SandyBridge-EP processors: 16 cores per node 128 GB memory 134 compute nodes: 112 nodes with SandyBridge-EP processors (16 cores

More information

Efficient Batch Scheduling Procedures for Shared HPC Resources

Efficient Batch Scheduling Procedures for Shared HPC Resources Efficient Batch Scheduling Procedures for Shared HPC Resources Fountanas Angelos 35354s05-38g August 22, 2008 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2008 Abstract

More information

Experiment design and administration for computer clusters for SAT-solvers (EDACC) system description

Experiment design and administration for computer clusters for SAT-solvers (EDACC) system description Journal on Satisfiability, Boolean Modeling and Computation 7 (2010) 77 82 Experiment design and administration for computer clusters for SAT-solvers (EDACC) system description Adrian Balint Daniel Gall

More information

Two-Level Scheduling Technique for Mixed Best-Effort and QoS Job Arrays on Cluster Systems

Two-Level Scheduling Technique for Mixed Best-Effort and QoS Job Arrays on Cluster Systems Two-Level Scheduling Technique for Mixed Best-Effort and QoS Job Arrays on Cluster Systems Ekasit Kijsipongse, Suriya U-ruekolan, Sornthep Vannarat Large Scale Simulation Research Laboratory National Electronics

More information

Managed File Transfer with Universal File Mover

Managed File Transfer with Universal File Mover Managed File Transfer with Universal File Mover Roger Lacroix roger.lacroix@capitalware.com http://www.capitalware.com Universal File Mover Overview Universal File Mover (UFM) allows the user to combine

More information

Advanced PBS Workflow Example Bill Brouwer 05/01/12 Research Computing and Cyberinfrastructure Unit, PSU wjb19@psu.edu

Advanced PBS Workflow Example Bill Brouwer 05/01/12 Research Computing and Cyberinfrastructure Unit, PSU wjb19@psu.edu Advanced PBS Workflow Example Bill Brouwer 050112 Research Computing and Cyberinfrastructure Unit, PSU wjb19@psu.edu 0.0 An elementary workflow All jobs consuming significant cycles need to be submitted

More information