System Software for High Performance Computing. Joe Izraelevitz



Similar documents
Cluster Implementation and Management; Scheduling

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

Simplest Scalable Architecture

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Cloud Computing through Virtualization and HPC technologies

LSKA 2010 Survey Report Job Scheduler

IOS110. Virtualization 5/27/2014 1

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Proceedings of the FAST 2002 Conference on File and Storage Technologies

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

How to Choose your Red Hat Enterprise Linux Filesystem

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Parallel I/O on JUQUEEN

Optimizing Shared Resource Contention in HPC Clusters

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

IBM LoadLeveler for Linux delivers job scheduling for IBM pseries and IBM xseries platforms running Linux

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

A High Performance Computing Scheduling and Resource Management Primer

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Lecture 36: Chapter 6

GraySort on Apache Spark by Databricks

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

MPI / ClusterTools Update and Plans

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Chapter 2: Getting Started

Sun Constellation System: The Open Petascale Computing Architecture

Fast Setup and Integration of ABAQUS on HPC Linux Cluster and the Study of Its Scalability

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Program Grid and HPC5+ workshop

Datacenter Operating Systems

The Google File System

Cobalt: An Open Source Platform for HPC System Software Research

Grid Scheduling Dictionary of Terms and Keywords

Scheduling and Resource Management in Computational Mini-Grids

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

IT service for life science

Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta

High Availability Databases based on Oracle 10g RAC on Linux

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Big Fast Data Hadoop acceleration with Flash. June 2013

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging


A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Virtualizare sub Linux: avantaje si pericole. Dragos Manac

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

A Flexible Resource Management Architecture for the Blue Gene/P Supercomputer

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

Performance Monitoring of Parallel Scientific Applications

W4118 Operating Systems. Instructor: Junfeng Yang

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

RevoScaleR Speed and Scalability

Overlapping Data Transfer With Application Execution on Clusters

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Batch Scheduling and Resource Management

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Linux Scheduler Analysis and Tuning for Parallel Processing on the Raspberry PI Platform. Ed Spetka Mike Kohler

Cray DVS: Data Virtualization Service

High Performance Computing. Course Notes HPC Fundamentals

SLURM Workload Manager

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

1 Storage Devices Summary

Parallel Processing using the LOTUS cluster

XFS File System and File Recovery Tools

The Hartree Centre helps businesses unlock the potential of HPC

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

An introduction to Fyrkat

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

PERFORMANCE TUNING ORACLE RAC ON LINUX

Big data management with IBM General Parallel File System

Red Hat Enterprprise Linux - Renewals DETAILS SUPPORTED ARCHITECTURE

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

How To Improve Performance On A Single Chip Computer

owncloud Enterprise Edition on IBM Infrastructure

Multi-core Programming System Overview

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Cloud Computing Where ISR Data Will Go for Exploitation

Headline in Arial Bold 30pt. The Need For Speed. Rick Reid Principal Engineer SGI

Using Multipathing Technology to Achieve a High Availability Solution

Overview of HPC Resources at Vanderbilt

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Building a Private Cloud with Eucalyptus

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Large Scale Distributed File System Survey

Distributed Data Storage Based on Web Access and IBP Infrastructure. Faculty of Informatics Masaryk University Brno, The Czech Republic

Distributed RAID Architectures for Cluster I/O Computing. Kai Hwang

Using PCI Express Technology in High-Performance Computing Clusters

ES-1 Elettronica dei Sistemi 1 Computer Architecture

SAM-FS - Advanced Storage Management Solutions for High Performance Computing Environments

Energy-aware job scheduler for highperformance

Scala Storage Scale-Out Clustered Storage White Paper

IBM System x GPFS Storage Server

Transcription:

System Software for High Performance Computing Joe Izraelevitz

Agenda Overview of Supercomputers Blue Gene/Q System LoadLeveler Job Scheduler General Parallel File System HPC at UR

What is a Supercomputer? Lots of other computers Closely colocated on a managed network Backing store The World's Simplest Supercomputer (Beowulf Cluster) IPC Linux w/ rsh enabled Linux w/ rsh enabled

Key Concepts in Supercomputers Cluster: a grouping of computers Node: a computer within the cluster Job: a program instance (a set of processes)

Operating Systems for HPC Each computer in the cluster has an operating system Off the shelf Linux Red Hat, Windows Server Specialized Compute Node Linux, CNK, INK But the supercomputer can also have an operating system called the system management software, which manages its component nodes OS System Management Software Application

Operating Systems for HPC System Management Software Components Node Operating System (Linux, CNK, etc.) Message Passing (MPI, PVM) Job Scheduler (Maui Scheduler, LoadLeveler) Resource Manager (Torque Resource Manager, LSF, SLURM) Backing Store (AFS, DFS, GPFS) Front End UI Hardware Architecture

Blue Gene/Q Cluster IBM Flagship supercomputer Third generation Complete Supercomputer System Architecture System Management Software

Blue Gene/Q Architecture File I/O Network Front End UI CNK OS (Compute Node Kernel) - on 17 cores IPC Network Backing Store GPFS (General Parallel File System) System Management Software INK OS (IO Node Kernel)

Blue Gene System Management Software Job Scheduler: LoadLeveler Resource Manager: LoadLeveler Central Manager IPC: MPICH2 File System: GPFS OS: CNK, INK

Job Scheduling Maximize resource usage CPU cycles, RAM, storage space, software licenses Algorithms SJF, LJF, FIFO, High Priority, etc. Considerations Job type, OS Awareness, Scalability, Efficiency, Dynamic Capability, Preemption, OS Scheduling

Job Scheduler: LoadLeveler Built in Blue Gene/Q job scheduler Checkpoint support Priority Queues Priority from user group FIFO within jobs of equal priority Generally nonpremptible

LoadLeveler: LL_DEFAULT Double Queue w/ Advanced Reservation As nodes are freed, reserve them for the next job NEGOTIATOR_PARALLEL_HOLD: Specify the amount of time a job can hold onto a resource Serial programs queued separate from parallel Issues Under utilization Jobs may never get enough resources within the time allotted

LoadLeveler: BACKFILL Double Queue w/ Advanced Reservation w/ Wall Clock Limit Scheduler can determine when resources will be available Can backfill shorter jobs before large jobs Issues Priority Inversion Incorrect wall clock limit

LoadLeveler: GANG Coordinated time multiplex scheduler Each time slice a virtual machine Issues Increased run time Context switch overhead RAM limited

General Parallel File System (GPFS) Blue Gene/Q default file system Parallel access to files, file metadata Design considerations: Highly parallel access Bandwidth bottleneck Huge disks and files Compute Nodes I/O Nodes I/O Network Disk Array

GPFS Overview Striped Files Files stored in (~256K) blocks per disk Distributed in round robin fashion Massively parallel file retrieval, bandwidth limited Vulnerable to failure RAID redundancy on each disk Block File

GPFS: Read/Write File Parallelism in two methods Distributed lock manager Lock for byte ranges within file Lock tokens issued to I/O nodes Data Shipping RCU managed blocks within single file Metadata Parallelism One I/O node designated as metanode for file and maintains the inode information

GPFS: Allocate/Delete Allocation Manager Block Maintain bitmap of free blocks Issue region locks File Allocation Get region lock, check for free space File Deletion Region Requires update of allocation manager Requires clearing disk space while holding region lock Delayed distributed deletion across I/O nodes based on lock ownership File

GPFS: Disk Organization Extensible Hashing within directories Use n bits of hashing function to group files On collision, increase to n+1 and reorganize Journal file system on disk Shared journal, so any node can restore disk

Message Passing (MPI) MPI (Message Passing Interface) Standard (not a library) Implementations with compliant compilers: OpenMPI, MPICH, mpijava, pympi, etc. Superfork to all CPUs available Master Process Process Process MPI_INIT() Process MPI_INIT(), MPI_SEND(), MPI_RECV(), MPI_WAIT() (mostly) OS, Cluster Manager independent

Resource Manager Resources managers provide the low-level functionality to start, hold, cancel, and monitor jobs. Without these capabilities, a scheduler alone cannot control jobs. Daemon runs on each node on top of OS Layer of abstraction between OS and Message Passing Interface Interfaces with Job Scheduler Manages job submission, admin interface Monitors compute resources

HPC at U of R Blue Streak Blue Gene/Q System BG/P SLURM Blue Gene/P System LoadLeveler Resource Manager /Scheduler BlueHive Torque Resource Manager, Maui Scheduler Intel Blade Center System

Works Cited Barney, Blaise. Message Passing Interface (MPI). Lawrence Liverpool National Laboratory. https://computing.llnl.gov/tutorials/mpi/#abstract (2012). Center for Integrated Research Computing. Resources. University of Rochester. http://www.circ.rochester.edu/resources.html (2012). Gilge, Megan. IBM System Blue Gene Solution: Blue Gene/Q Application Development. International Technical Support Organization. IBM. March 2012. http://www.redbooks.ibm.com/redpieces/pdfs/sg247948.pdf Iqbal, Saeed, Rinku Gupta, Yung-Chin Fang. Planning Considerations for Job Scheduling in HPC Clusters. Dell Power Solutions, February 2005. http://www.dell.com/downloads/global/power/ps1q05-20040135-fang.pdf Lakner, Gary and Brant Knudson. IBM System Blue Gene Solution: Blue Gene/Q System Administration. International Technical Support Organization. IBM. June 2012. http://www.redbooks.ibm.com/redbooks/pdfs/sg247869.pdf Kannan, Subramanian, Mark Roberts, Peter Mayes, Dave Brelsford, Joseph F Skovira. Workload Management with LoadLeveler. International Technical Support Organization. IBM. November 2001. http://www.redbooks.ibm.com/redbooks/pdfs/sg246038.pdf Schmuck, Frank and Roger Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. Proceedings of the Conference on File and Storage Technologies (FAST 02), 28 30 January 2002, Monterey, CA, pp. 231 244. (USENIX, Berkeley, CA.)