Cobalt: An Open Source Platform for HPC System Software Research

Similar documents
Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

System Software for High Performance Computing. Joe Izraelevitz

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Job Scheduling with Moab Cluster Suite

A High Performance Computing Scheduling and Resource Management Primer

Parallel Visualization of Petascale Simulation Results from GROMACS, NAMD and CP2K on IBM Blue Gene/P using VisIt Visualization Toolkit

Table of Contents Introduction and System Requirements 9 Installing VMware Server 35

Using the Yale HPC Clusters

LSKA 2010 Survey Report Job Scheduler

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Parallel Processing using the LOTUS cluster

The Production Cluster Construction Checklist

Resource Models: Batch Scheduling

MPI / ClusterTools Update and Plans

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO

INF-110. GPFS Installation

HPC performance applications on Virtual Clusters

Quick Tutorial for Portable Batch System (PBS)

The Moab Scheduler. Dan Mazur, McGill HPC Aug 23, 2013

Example of Standard API

Dell UPS Local Node Manager USER'S GUIDE EXTENSION FOR MICROSOFT VIRTUAL ARCHITECTURES Dellups.com

Optimizing Shared Resource Contention in HPC Clusters

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

SLURM Workload Manager

ANSYS Remote Solve Manager User's Guide

No.1 IT Online training institute from Hyderabad URL: sriramtechnologies.com

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Basic TCP/IP networking knowledge of client/server concepts Basic Linux commands and desktop navigation (if don't know we will cover it )

NEC HPC-Linux-Cluster

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

UNISOL SysAdmin. SysAdmin helps systems administrators manage their UNIX systems and networks more effectively.

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Job Scheduling on a Large UV Chad Vizino SGI User Group Conference May Pittsburgh Supercomputing Center

II. Installing Debian Linux:

Agenda. Using HPC Wales 2

Cloud Server. Parallels. Key Features and Benefits. White Paper.

WSO2 Business Process Server Clustering Guide for 3.2.0

Eucalyptus Tutorial HPC and Cloud Computing Workshop

A CP Scheduler for High-Performance Computers

Red Hat Enterprprise Linux - Renewals DETAILS SUPPORTED ARCHITECTURE

Enterprise Content Management System Monitor 5.1 Agent Debugging Guide Revision CENIT AG Author: Stefan Bettighofer

Intelligent Power Protector User manual extension for Microsoft Virtual architectures: Hyper-V 6.0 Manager Hyper-V Server (R1&R2)

Configuring and Launching ANSYS FLUENT Distributed using IBM Platform MPI or Intel MPI

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

High Availability of the Polarion Server

Configure AlwaysOn Failover Cluster Instances (SQL Server) using InfoSphere Data Replication Change Data Capture (CDC) on Windows Server 2012

Introduction to HPC Workshop. Center for e-research

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

INTRODUCTION TO COMPUTING CPIT 201 WEEK 13 LECTURE 3

<Insert Picture Here> Oracle Database Support for Server Virtualization Updated December 7, 2009

Alfresco Enterprise on AWS: Reference Architecture

Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano

Running a Workflow on a PowerCenter Grid

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems. Jason Cope

Samsung Magician v.4.5 Introduction and Installation Guide

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

GRID workload management system and CMS fall production. Massimo Sgaravatto INFN Padova

MOSIX: High performance Linux farm

Oracle WebLogic Server

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

Integrated and reliable the heart of your iseries system. i5/os the next generation iseries operating system

Design and Implementation of the Heterogeneous Multikernel Operating System

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

IT6204 Systems & Network Administration. (Optional)

Getting Started with Endurance FTvirtual Server

Advanced PBS Workflow Example Bill Brouwer 05/01/12 Research Computing and Cyberinfrastructure Unit, PSU

Experiences with Eucalyptus: Deploying an Open Source Cloud

Resource Scheduling Best Practice in Hybrid Clusters

Load balancing MySQL with HaProxy. Peter Boros Percona 4/23/13 Santa Clara, CA

Zend Server 4.0 Beta 2 Release Announcement What s new in Zend Server 4.0 Beta 2 Updates and Improvements Resolved Issues Installation Issues

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

How To Understand The History Of An Operating System

Transcription:

Cobalt: An Open Source Platform for HPC System Software Research Edinburgh BG/L System Software Workshop Narayan Desai Mathematics and Computer Science Division Argonne National Laboratory October 6, 2005 Argonne National Laboratory is managed by The University of Chicago for the U.S. Department of Energy

Cobalt Research Project Goal: Investigate advanced systems management for complex ultra-scale architectures Strategy: Build an Open Source platform of parallel tools and components that enable rapid experimentation and exploration of advanced features Architecture Components based on the SciDAC Scalable System Software project A collection of interacting components A communication and event system Well-defined interfaces between components Portable BG/L systems Linux and MacOSX Clusters Initial focus on reconfigurable user environments 2

Motivation Needed to support both computational and computer science users System software developers have different needs from computational scientists System hangs are common, even desired A large variety of configurations are required, sometimes simultaneously The application can span all software on a node System failures are more common during system software research and development System software must deal gracefully with faults Most resource managers not suitable for system software research environments 3

Resource, Job, and Queue Management Classic packages: OpenPBS, PBS Pro, Maui, LSF, LoadLeveller Shortcomings: - Difficult (or impossible) to modify large monolithic systems - Interfaces between components poorly documented Cobalt: smaller and simpler is better Cobalt guts admittedly feature-poor RISC approach to resource and job management ~4K lines of component code (mostly Python) ~1K lines of BG/L specific code However, its agility makes it the perfect research platform Rapid reconfiguration of components permits exploration of many interlinked system management issues Porting and adapting code to new platforms and system models is relatively easy Small codebase size makes the system easy to modify, when needed If you don't like a particular component implementation, write another one 4

Cobalt Architecture on Blue Gene Allocation Manager Scheduler CNKCNKCNKCNK Queue Manager PVFS2 CIOD I/O Node Kernel and Ramdisk Process Manager IBM Cobalt Bridge API MMCS CIODB ZeptoOS mpirun DB2 5

Cobalt on BlueGene/L %man cqsub cqsub - submit jobs to the queue manager for execution Dynamic Kernel Selection Combined with ZeptoOS, jobs can use different I/O node kernels and tuning parameters This extremely important userlevel feature is ONLY available via Cobalt Provides unique flexibility to explore new territory Testing and benchmarking of experimental kernels Application-dependent kernel tuning Required for system software research cqsub [-d] [-q queue] [-p project] [-t time] [-n number of nodes] [-m mode] [-c process count] <executable> <options> [ ] -k kernel profile Run the job with the specified kernel profile. Small partition support Cobalt s component-based design allowed rapid support for 32- node partitions Absolutely critical for application porting workshops The most requested Cobalt feature for BG/L! 6

Submitting Jobs: cqsub 32 nodes 30 minutes $ cqsub -n 32 -t 30 cpi 2702 virtual node mode 64 processes default queue $ cqsub -n 32 -c 64 -m vn -t 30 -q default cpi 2703 Other Options Output Prefix: -O <output path prefix> Kernel Profile: -k <profile> Environent Variables: -e var1=val2:var2=val2 Working Directory: -C <path> 7

Job Status: cqstat cqstat -f JobID User WallTime Nodes State Location Mode Procs Queue StartTime ================================================================================================== 16674 cpsosa 03:40:00 1024 queued N/A co 1024 default N/A 16743 runesha 02:00:00 1024 queued N/A co 1024 default N/A 16941 linsay 24:00:00 128 running R000_J102-128 vn 256 default 10/03/05 11:59:05 16982 fischer 24:00:00 512 running ANL_R001 vn 1024 default 10/04/05 03:40:24 8

Killing Jobs: cqdel cqsub -n 32 -c 1 -t 10 -k ZeptoOS-1.1./test 14 Cqstat JobID User WallTime Nodes State Location ======================================================= 14 beckman 00:10:00 32 running UE_R001_32B cqdel 14 Deleted Jobs JobID User ================ 14 beckman Combined with ZeptoOS, jobs can use different I/O node kernels and tuning parameters 9

Partition Status: partlist $ partlist <Partition admin="online" name="r000_j102-32" queue="short" state="idle" /> <Partition admin="online" name="anl_r001" queue="default" state="busy" /> <Partition admin="online" name="r000_j102-64" queue="short" state="idle" /> <Partition admin="online" name="r000_j106-64" queue="short" state="idle" /> <Partition admin="online" name="r000_j111-64" queue="short" state="idle" /> <Partition admin="online" name="r000_j115-64" queue="default" state="idle" /> <Partition admin="online" name="r000_j203-64" queue="default" state="idle" /> <Partition admin="online" name="r000_j207-64" queue="default" state="idle" /> <Partition admin="online" name="r000_j210-64" queue="default" state="idle" /> <Partition admin="online" name="r000_j214-64" queue="default" state="idle" /> <Partition admin="online" name="r000_j102-128" queue="default" state="busy" /> <Partition admin="online" name="r000_j111-128" queue="default" state="idle" /> <Partition admin="online" name="r000_j203-128" queue="default" state="idle" /> <Partition admin="online" name="r000_j210-128" queue="default" state="idle" /> <Partition admin="online" name="r000_j104-32" queue="short" state="idle" /> <Partition admin="online" name="r000_j106-32" queue="admin" state="idle" /> 10

Scheduling Partitions are defined for scheduling purposes Includes size, queue, etc One partition definition per location for user jobs Partitions can overlap, but dependencies need to be definied The scheduler can effectively pack jobs onto the machine Greedy backfill is implemented Reservations Per-Queue policies default (fifo + backfill) short queue ( < 30 minute jobs) easy to implement more 11

Dynamic Kernel Selection User-setup kernel profiles includes CNK, ION kernel, ION ramdisk, and loader Each partition configured with a partition specific boot location User jobs include a kernel profile with a default profile of default The partition specific boot location is a symlink Cobalt modifies this link during each job, once execution location has been established The partition boots the specified kernel upon job startup 12

Active Development Areas Support for user specified ZeptoOS kernel and ramdisk options Direct process manager interface via the Bridge APIs (replacement for mpirun interactions) Python bindings for the Bridge APIs Will allow easy implementation of BG/L specific management and status scripts Scheduler improvements Multi-rack allocation policies More efficient backfill Investigate rule-based scheduling policies Explore coordination of ZeptoOS node startup and shutdown features Better user interface for system software development Perhaps remove the partition reboot requirement Improved installation process 13

Cobalt Code Status In production at Argonne since 08/2003 on BG/L since 02/2005 In production at NCAR since 05/2005 Available at EPCC since Tuesday (10/2005) Under evaluation at several other sites Development process is open Suggestions and patches are both welcome NCAR has helped with both code and documentation improvements Available from http://www.mcs.anl.gov/cobalt 14

Cobalt Results Small codebase allows easy modifications Usability improvements New features (~3 minutes is the current record) Site-specific customizations Porting to new systems is quite easy Providing a uniform interface for clusters and BG/L systems Properly arbitrating between system software developers and computational science users Allows on-the-fly system configuration changes Ensures that computational jobs get non-development versions of system software Able to protect each user group from the other and its software requirements Theraputic effect on sysadmin blood pressure System software small enough to be readily understood and modified 15

The End Questions? http://www.mcs.anl.gov/cobalt 16