Job Scheduling on a Large UV 1000. Chad Vizino SGI User Group Conference May 2011. 2011 Pittsburgh Supercomputing Center

Similar documents
The CNMS Computer Cluster

Getting Started with HPC

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

An Introduction to High Performance Computing in the Department

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Using NeSI HPC Resources. NeSI Computational Science Team

System Requirements Table of contents

Building Clusters for Gromacs and other HPC applications

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Main Memory Data Warehouses

ontune SPA - Server Performance Monitor and Analysis Tool

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

A Crash course to (The) Bighouse

Using the Yale HPC Clusters

Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept

Sun Constellation System: The Open Petascale Computing Architecture

Enabling Technologies for Distributed and Cloud Computing

OpenMP Programming on ScaleMP

Example of Standard API

The Evolution of Cray Management Services

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

SUSE Linux Enterprise 10 SP2: Virtualization Technology Support

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

Martinos Center Compute Clusters

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

The Hardware Dilemma. Stephanie Best, SGI Director Big Data Marketing Ray Morcos, SGI Big Data Engineering

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Running applications on the Cray XC30 4/12/2015

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Cloud Computing through Virtualization and HPC technologies

Lecture 2 Parallel Programming Platforms

Introduction to HPC Workshop. Center for e-research

Parallel Processing using the LOTUS cluster

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Very Large Enterprise Network, Deployment, Users

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

Monitoring Tools for Large Scale Systems

SGI High Performance Computing

Altix Usage and Application Programming. Welcome and Introduction

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Performance and scalability of a large OLTP workload

Virtualization Guide. McAfee Vulnerability Manager Virtualization

Enabling Technologies for Distributed Computing

Quick Tutorial for Portable Batch System (PBS)

CHAPTER FIVE RESULT ANALYSIS

Manual for using Super Computing Resources

System requirements for MuseumPlus and emuseumplus

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

How to Deploy OpenStack on TH-2 Supercomputer Yusong Tan, Bao Li National Supercomputing Center in Guangzhou April 10, 2014

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

PARALLELS SERVER 4 BARE METAL README

Hodor and Bran - Job Scheduling and PBS Scripts

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Biowulf2 Training Session

Job Scheduling with Moab Cluster Suite

Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers

Load and Performance Testing

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Microsoft Windows Compute Cluster Server 2003 Getting Started Guide

Analyzing the Virtualization Deployment Advantages of Two- and Four-Socket Server Platforms

Understanding the Benefits of IBM SPSS Statistics Server

SOFTWARE TECHNOLOGIES

IBM License Metric Tool Version Installing with embedded WebSphere Application Server

How To Write An Article On An Hp Appsystem For Spera Hana

Very Large Enterprise Network Deployment, 25,000+ Users

DIABLO VALLEY COLLEGE CATALOG

Dell KACE K1000 Management Appliance. Administrator Guide. Release 5.3. Revision Date: May 16, 2011

MOSIX: High performance Linux farm

HPC Update: Engagement Model

An Oracle White Paper August Oracle WebCenter Content 11gR1 Performance Testing Results

Performance Characteristics of Large SMP Machines

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

Trend Micro Incorporated reserves the right to make changes to this document and to the products described herein without notice.

Multi-core and Linux* Kernel

Capacity Planning for Microsoft SharePoint Technologies

HP Universal CMDB. Software Version: Support Matrix

Improved LS-DYNA Performance on Sun Servers

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

Protect SQL Server 2012 AlwaysOn Availability Group with Hitachi Application Protector

SQL Server PDW. Artur Vieira Premier Field Engineer

Transcription:

Job Scheduling on a Large UV 1000 Chad Vizino SGI User Group Conference May 2011

Overview About PSC s UV 1000 Simon UV Distinctives UV Operational issues Conclusion

PSC s UV 1000 - Blacklight

Blacklight after router installation

Lots of cables

Blacklight Hardware Installed September 2010 Routers installed December 2010 2 x 16TB SSIs 128 Blades per SSI 8GB per core 2048 physical cores per SSI Dual socket 8-way Intel Xeon 2.27GHz (Nahalem) 16 physical cores per blade 32 Hyper-Threaded cores per blade

Current SCRATCH Lustre 1.8.5 92TB 8 Servers IB SDR connection via Blacklight 2 x DDN 8550 New deployment coming

New SCRATCH Imminent deployment Runs drives at 95% of spindle speed See Michael Levine s talk on Blacklight

User Environment Login node Dual quad core Intel Xeon 2.4GHz (Westmere) 24 GB memory Common /usr/users ($HOME), /usr/local (packages managed with modules )

Login node Access/edit files Compile codes Submit/monitor jobs Users may not login to compute nodes (UV SSIs) Interactive jobs via qsub I are allowed Runs Torque server and scheduler processes

Software SUSE Linux Enterprise Server 11.1 SGI Performance Suite 1.1 Torque Resource Manager 2.3.13 (with local mods) Simon scheduler (locally developed)

About Simon Locally developed job scheduler Work started 10 years ago Integrated with Torque Ported to various architectures Compaq AlphaServer SC (RMS) Cray XT3 (CPA) SGI Altix 4700 (cpusets)

UV Distinctive #1: Cpusets Job assigned to whole blades Users request ncpus and walltime limits Get more memory by requesting more blades Memory enforcement Job killed when cpuset memory_pressure > 0 Cpuset is cpu exclusive Cpuset is mem exclusive Lessons learned from Altix 4700 experience

Cpusets facilitate repeatable performance

Hard to achieve repeatable performance!

More on Simon Written in TCL About 4,200 lines of code Integration with Torque Backfill Reservations Stuffing control (QOS) Co-scheduling software licenses Flexible walltime support

Torque Integration Features Linux kernel job integration Mom calls job_create() with Torque job id Enables use of ja by users csacom j `printf %x <torque_job_id>` Limiting process threads Java garbage collection threads -XX:ParallelGCThreads=N Thread_factor set on queue Limit = thread_factor*ncpus

Distinctive #2: Dealing with Hyper-Threads

Hyper-Threads and Jobs Users specify physical core count qsub l ncpus=n! N must be multiple of 16 PBS_NCPUS (N) PBS_HT_NCPUS (N*2) mpirun np $PBS_NCPUS! Or, mpirun np $PBS_HT_NCPUS! Mom daemon creates cpuset with Hyper- Thread cpu count (N*2)

CPU Numbering from topology output CPU Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB)! --------------------------------------------------------------------------------! 0 r001i01b00 00 00 0 6 46 2267 32d/32i 256 24576! 1 r001i01b00 00 01 2 6 46 2267 32d/32i 256 24576! 2 r001i01b00 00 02 4 6 46 2267 32d/32i 256 24576! 3 r001i01b00 00 03 6 6 46 2267 32d/32i 256 24576!...! 13 r001i01b00 01 09 50 6 46 2267 32d/32i 256 24576! 14 r001i01b00 01 10 52 6 46 2267 32d/32i 256 24576! 15 r001i01b00 01 11 54 6 46 2267 32d/32i 256 24576!...! 2048 r001i01b00 00 00 1 6 46 2267 32d/32i 256 24576! 2049 r001i01b00 00 01 3 6 46 2267 32d/32i 256 24576! 2050 r001i01b00 00 02 5 6 46 2267 32d/32i 256 24576! 2051 r001i01b00 00 03 7 6 46 2267 32d/32i 256 24576! 2052 r001i01b00 00 08 17 6 46 2267 32d/32i 256 24576!...! 2060 r001i01b00 01 08 49 6 46 2267 32d/32i 256 24576! 2061 r001i01b00 01 09 51 6 46 2267 32d/32i 256 24576! 2062 r001i01b00 01 10 53 6 46 2267 32d/32i 256 24576! 2063 r001i01b00 01 11 55 6 46 2267 32d/32i 256 24576!!

Blade scheduling

System Hierarchy and Scheduling Rack IRU Blade Memnode Cpus Boot blade (1 st blade of each SSI) not scheduled IO blades (have IB cards) not scheduled Simon maintains list of free and in-use memory nodes per SSI Simon manipulates nodeset resource

Nodeset resource used for job placement Simon places jobs using nodeset Mems:cpus nodeset=2-3:16-31,2064-2079! Used by pbs_mom to construct cpuset on Blacklight node Queues can have a memnode mask Target specific memnodes (blades) Debug jobs on blade 127 (1/2 memory) Also on other nodes with < 128GB (full memory)

PMM A text based monitor 1 (bl0) 2 (bl1)=partition! 1 2 3 4 5 6 7 8=RACK! IRU -------- -------- -------- -------- -------- -------- -------- --------! ******** ******** ******** *******. ******** ******** ******** *******.! ******** ******** ******** ******** ******** ******** ******** ********! 23 -------- -------- -------- -------- -------- -------- -------- --------! ******** ******** ******** ******** ******** ******** ******** ********! B*xxx*** ******** ******** ***.**** B*xxx*** ******** ******** ********! 01 -------- -------- -------- -------- -------- -------- -------- --------! 4567CDEF=HEX BLADE # Key: *=allocated B=boot! 012389AB.=free x=not scheduled!

Blacklight Racks

Blacklight IRUs

Blacklight 3D Monitor See Blacklight3DMonitor.avi

UV Distinctive #3: Lots of Hardware

Database Holds Static Configuration Data SQLite SQL database engine Provides one place to get SSI configuration information for both SSIs Easy access to topology command output Each SSI Integration with Simon planned Used by pmm and Blacklight 3D Monitor

Database Tables (all in under 500 kilobytes!) Partitions Blades Cpus Cpusets Memnodes Devices Routers

Partitions Table sqlite> select * from partitions limit 1;! partition_num = 1! serial = UV-00000071! hostname = bl0.psc.teragrid.org! blades = 128! routers = 96! cpus = 4096! mem_total_gb = 16060.64! io_risers = 5! infiniband_controllers = 6! network_controllers = 2! scsi_controllers = 1! usb_controllers = 8! vga_gpus = 1!!

Blades Table sqlite> select * from blades limit 1;! blade_num = 0! partition_num = 1! blade_name = r001i01b00! rack = 1! iru = 1! blade = 0! asic = UVHub 2.0! nasid = 0! cpus = 32! memory_kb = 132077200! configured = 0! comment = boot!

Cpusets and Memnodes Tables sqlite> select * from cpusets limit 1;! cpuset_num = 0! partition_num = 1! cpuset_name = boot! mems = 0-1! cpus = 0-15,2048-2063!!! sqlite> select * from memnodes limit 1;! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! mem_total_kb = 64968336!

Memnodes and Cpus Tables sqlite> select * from memnodes limit 1;! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! mem_total_kb = 64968336!! sqlite> select * from cpus limit 1;! cpu_num = 0! memnode_num = 0! blade_num = 0! partition_num = 1! cpuset_num = 0! physid = 0! coreid = 0! apic_id = 0! family = 6! model = 46! speed = 2266! l1 = 32d/32i! l2 = 256! l3 = 24576!

Devices and Routers Tables sqlite> select * from devices limit 1;! blade_num = 0! partition_num = 1! pci_address = 0000:01:00.0! x_server_display = -! device = Intel 82576 Gigabit Network Connection!!! sqlite> select * from routers limit 1;! router_num = 0! partition_num = 1! router_name = r001i01r00! rack = 1! upos = 1! router = 0! class = NL5Router!

Database Queries Facilitates blade name and cpu/memnode translation Look up last job use by blade Helps answer: What blades did a job use? What memnodes and partition correspond to a given blade name? Which blades have less memory than expected after boot?

Operations: Pre-job scan (prologue script) Cpuset coherency at startup Tmpfs Ram based file system based on cpuset s memory /dev/tmpfs/<job_id> directory Created at job start Destroyed at job end (also scan for orphans in prologue) Lustre check Save job script for future reference

Operations: Memory failures Check at boot time via topology command difference checker Watch memlog via Simple Event Correlator (SEC) SEC updates system db so we can keep track of failures Provides place holder so we don t forget about them Remove from db after hardware replaced

Future Plans Develop database integration Predictive walltime scheduling Mitigate long drain times D. Tsafrir, Y. Etsion, and D. G. Feitelson, Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel & Distributed Syst. 18(6), pp. 789-803, Jun 2007. Topology-aware scheduling algorithms

More PSC! Michael Levine giving customer keynote on Blacklight, Wednesday at 9:00am.