Experience with Preemption for Urgent Computing

Similar documents
Grid Scheduling Dictionary of Terms and Keywords

LSKA 2010 Survey Report Job Scheduler

Scheduling and Resource Management in Computational Mini-Grids

2. COMPUTER SYSTEM. 2.1 Introduction

MPI / ClusterTools Update and Plans

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Hadoop on the Gordon Data Intensive Cluster

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

- An Essential Building Block for Stable and Reliable Compute Clusters

The CNMS Computer Cluster

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

REM-Rocks: A Runtime Environment Migration Scheme for Rocks based Linux HPC Clusters

Parallel Processing using the LOTUS cluster

Building Clusters for Gromacs and other HPC applications

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Solution Brief: Creating Avid Project Archives

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Scalable Multi-Node Event Logging System for Ba Bar

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Utilizing the SDSC Cloud Storage Service

The Hartree Centre helps businesses unlock the potential of HPC

Getting Started with HC Exchange Module

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research


A Theory of the Spatial Computational Domain

U"lizing the SDSC Cloud Storage Service

Getting Started with HPC

On-Demand Supercomputing Multiplies the Possibilities

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Simulation Platform Overview

Job Scheduling with Moab Cluster Suite

Linux Cluster Computing An Administrator s Perspective

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Optimizing Shared Resource Contention in HPC Clusters

Sage Grant Management System Requirements

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Autodesk Inventor on the Macintosh

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

HADOOP MOCK TEST HADOOP MOCK TEST I

locuz.com HPC App Portal V2.0 DATASHEET

2009 Oracle Corporation 1

Asymmetric Active-Active High Availability for High-end Computing

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

RES is a distributed infrastructure of Spanish HPC systems. The objective is to provide a unique service to HPC users in Spain

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

Magellan A Test Bed to Explore Cloud Computing for Science Shane Canon and Lavanya Ramakrishnan Cray XE6 Training February 8, 2011

Ignify ecommerce. Item Requirements Notes

The Asterope compute cluster

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Cloud Campus Services in PLATON e-science Platform

Technical Overview of Windows HPC Server 2008

An Introduction to High Performance Computing in the Department

High Availability of the Polarion Server

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

System Requirements - Table of Contents

Introduction to parallel computing and UPPMAX

IT Infrastructure Management

New!! - Higher performance for Windows and UNIX environments

Lecture 1: the anatomy of a supercomputer

ABAQUS High Performance Computing Environment at Nokia

Building a Linux Cluster

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

IS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration

A High Performance Computing Scheduling and Resource Management Primer

Ten Reasons to Switch from Maui Cluster Scheduler to Moab HPC Suite Comparison Brief

SERVER CLUSTERING TECHNOLOGY & CONCEPT

Data Mining with Hadoop at TACC

Cluster Computing at HRI

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Autodesk 3ds Max 2010 Boot Camp FAQ

A National Computing Grid: FGI

FILE ARCHIVAL USING SYMANTEC ENTERPRISE VAULT WITH EMC ISILON

A Chromium Based Viewer for CUMULVS

Avalanche Site Edition

Large File System Backup NERSC Global File System Experience

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Virtual Machine Synchronization for High Availability Clusters

InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice

Apache Hadoop FileSystem Internals

Planning the Installation and Installing SQL Server

Microsoft Exchange Server 2003 Deployment Considerations

Streamline Computing Linux Cluster User Training. ( Nottingham University)

MAGELLAN 54 S CIDAC REVIEW S PRING 2010 WWW. SCIDACREVIEW. ORG

Condor for the Grid. 3)

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Parallels Cloud Storage

XenData Product Brief: SX-550 Series Servers for LTO Archives

QUADRICS IN LINUX CLUSTERS

DiskPulse DISK CHANGE MONITOR

Transcription:

Experience with Preemption for Urgent Computing Jason Hedden, Joseph Insley, Ti Leggett, Michael E. Papka UChicago/Argonne TeraGrid Resource Provider

TeraGrid TeraGrid is an open scientific discovery infrastructure combining leadership class resources at nine partner sites to create an integrated, persistent computation resource. Collection of high-performance networks high-performance computers data resources and tools high-end experimental facilities Providing more than 102 teraflops of computing capability more than 15 petabytes of online and archival data storage over 100 discipline-specific databases Indiana University, Oak Ridge National Laboratory, National Center for Supercomputing Applications, Pittsburgh Supercomputing Center, Purdue University, San Diego Supercomputer Center, Texas Advanced Computing Center, University of Chicago/Argonne National Laboratory, and the National Center for Atmospheric Research.

TeraGrid

96 Visualization Nodes Dual Intel Xeon IA-32 Processors 4GB Memory each GeFORCE 6600GT AGP Graphics Cards 64 Compute Nodes Dual Intel Itanium2 IA-64 Processors 4GB Memory each UChicago/ Argonne 4TB of Disk for Home Directories 16TB of Disk for Parallel-IO - Temporary Storage High Performance Myrinet Interconnect High Performance Gigabit Ethernet 2 Visualization Login Nodes Dual Intel Xeon IA-32 Processors (4GB Memory each) GeFORCE 6600GT AGP Graphics Cards 2 Compute Login Nodes Dual Intel Itanium2 IA-64 Processors 4GB Memory each

Technology and Social Barriers Technology Technology needed to deliver on-demand computing in a rapid, reliable, and routine manner. Investigate how smaller sites can contribute to a mission of improving science and engineering. Social Investigate how smaller sites can contribute to a mission of improving science and engineering. Understand the techniques/incentives needed to promote continued use.

Policy Change Notification of additional use of resource Control based on SPRUCE tokens Prioritization of jobs Next to run jobs (no preemption) Run immediately (preemption) Explanation of incentives for using Alternative charging model(s) Discount for jobs that are not preempted No charge for preempted jobs

Torque and Moab Torque is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community based effort based on the original Portable Batch System (PBS) project and has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by many leading edge HPC organizations including TeraGrid. Moab is a closed source advanced job scheduler that integrates with Torque. It is a highly optimized and configurable tool capable of supporting an array of scheduling policies, dynamic priorities, extensive reservations and fairshare capabilities. Both of these tools are developed, and supported by Cluster Resources Inc. www.clusterresources.com

Torque Modifications Torque spruce queue configuration Qmgr: create queue spruce queue_type=execution Qmgr: set queue spruce started=true Qmgr: set queue spruce enabled=true Torque submit filter (qsub wrapper.) Contains spruce and local site specific code. Reject jobs submitted directly to the spruce queue. Verify the user identity and requested resources. Assign urgency priority level. Torque accounting filter Discount jobs submitted on the IA64 resources. Move preempted job information to local logs. Alert local operations of any jobs submitted to the urgent spruce queue

Moab Modifications Set preemption policy to CANCEL, REQUEUE, or CHECKPOINT. CANCEL causes jobs to be deleted and removed from the queue. PREEMPTPOLICY CANCEL Moab manages the urgency level of incoming jobs through Quality of Service configurations. Preemption is used to guarantee resources to critical jobs by removing the lowest priority best fit jobs. The priority value is a dynamic value that can be configured to adjust the value based on time in queue, requested resources, and user, or group accounts. QOSCFG[red] QOSCFG[orange] QOSCFG[yellow] QOSCFG[default] QFLAGS=PREEMPTOR PRIORITY=1000000 QFLAGS=PREEMPTEE PRIORITY=10000 QFLAGS=PREEMPTEE PRIORITY=5000 QFLAGS=PREEMPTEE PRIORITY=1 Assign Quality of Service configurations to the spruce and default Torque queues. CLASSCFG[spruce] CLASSCFG[dque] QDEF=yellow QLIST=orange,red QDEF=default Enable logging of preemption events. RECORDEVENTLIST JOBPREEMPT

Process of Preemption Activate Token UChicago Argonne SPRUCE Web services Job fails no submit filter verify user, resource, time, urgency level, etc. yes Job submission (qsub or Grid) torque moab urgency level

Tornado Season Partnership with LEAD and Spruce Currently testbed for April 1st - June 1st 4 tokens activated for 72 hours each ~30 test runs in preemption mode Ready for production use, using tokens integrated with LEAD portal/gateway

Experience Last week we saw 68.4% utilization of the machine Noticed it but not an issue Was preempted handful of time Monday, maybe

Future (joint) Work How-to guide on what is needed for us to auto restart of preempted users Restart of preempted job Restart from checkpoint file Support for next to run tokens Flexible charging structure Network reservation (bandwidth reservation) Coupling analysis/visualization By arranging reservations for needed resources

Acknowledgement TeraGrid team at UChicago/Argonne Spruce team at UChicago/Argonne This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357, and in part by the National Science Foundation under grant OCI-0504086.