Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Similar documents

Report on Project: Advanced System Monitoring for the Parallel Tools Platform (PTP)

A highly configurable and efficient simulator for job schedulers on supercomputers

Scalable System Monitoring

Mitglied der Helmholtz-Gemeinschaft UNICORE. Uniform Access to JSC Resources. Michael Rambadt, 20.

A High Performance Computing Scheduling and Resource Management Primer

LSKA 2010 Survey Report Job Scheduler

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Developing Parallel Applications with the Eclipse Parallel Tools Platform

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Broadening Moab/TORQUE for Expanding User Needs

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Anwendungsintegration und Workflows mit UNICORE 6

Chapter 2: Getting Started

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

Distributed Computing and Big Data: Hadoop and MapReduce

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

HPC Workload Management Tools: A Competitive Benchmark Study

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Anar Manafov, GSI Darmstadt. GSI Palaver,

System Software for High Performance Computing. Joe Izraelevitz

Managing Multiple Multi-user PC Clusters Al Geist and Jens Schwidder * Oak Ridge National Laboratory

Technical Specification. Solutions created by knowledge and needs

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Private Public Partnership Project (PPP) Large-scale Integrated Project (IP)

Grid Scheduling Dictionary of Terms and Keywords

Discovering the Petascale User Experience in Scheduling Diverse Scientific Applications: Initial Efforts towards Resource Simulation

UFTP High-performance data transfer for UNICORE

The CMS analysis chain in a distributed environment

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

Job Scheduling with Moab Cluster Suite

Ten Reasons to Switch from Maui Cluster Scheduler to Moab HPC Suite Comparison Brief

Parallel Visualization of Petascale Simulation Results from GROMACS, NAMD and CP2K on IBM Blue Gene/P using VisIt Visualization Toolkit

Overview of HPC Resources at Vanderbilt

New Issues and New Capabilities in HPC Scheduling with the Maui Scheduler

Microsoft Enterprise Search for IT Professionals Course 10802A; 3 Days, Instructor-led

Open-source and Standards - Unleashing the Potential for Innovation of Cloud Computing

MPI / ClusterTools Update and Plans

GridWay: Open Source Meta-scheduling Technology for Grid Computing

Interoperability between Sun Grid Engine and the Windows Compute Cluster

Sun Grid Engine, a new scheduler for EGEE

Monitoring Infrastructure for Superclusters: Experiences at MareNostrum

Introduction to Arvados. A Curoverse White Paper

Maintaining Non-Stop Services with Multi Layer Monitoring

ANSYS Remote Solve Manager User's Guide

Program Grid and HPC5+ workshop

CoolEmAll - Tools for realising an energy efficient data centre

Chapter -5 SCALABILITY AND AVAILABILITY

Monitoring Nginx Server

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Optimizing the Performance of Your Longview Application

Development of parallel codes using PL-Grid infrastructure.

Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Output Agent Throughput

Using NeSI HPC Resources. NeSI Computational Science Team

Provisioning and Resource Management at Large Scale (Kadeploy and OAR)

HPC Cloud Computing with OpenNebula

Monitoring Clusters and Grids

Software-defined Storage Architecture for Analytics Computing

Tool - 1: Health Center

Using DeployR to Solve the R Integration Problem

Capacity Planning for Microsoft SharePoint Technologies

Key Requirements for a Job Scheduling and Workload Automation Solution

System Models for Distributed and Cloud Computing

An Oracle White Paper September Advanced Java Diagnostics and Monitoring Without Performance Overhead

Sun in HPC. Update for IDC HPC User Forum Tucson, AZ, Sept 2008

Current Status of FEFS for the K computer

Gratia: New Challenges in Grid Accounting.

WM Technical Evolution Group Report

Batch Job Analysis to Improve the Success Rate in HPC

Lab Management, Device Provisioning and Test Automation Software

WissGrid. JHOVE2 over the Grid 1. Dokumentation. Arbeitspaket 3: Langzeitarchivierung von Forschungsdaten. Änderungen

E U F O R I A D S A 3. 1

Easier - Faster - Better

Toad for Oracle 8.6 SQL Tuning

Grid Scheduling Architectures with Globus GridWay and Sun Grid Engine

Running a Workflow on a PowerCenter Grid

MEETING THE CHALLENGES OF COMPLEXITY AND SCALE FOR MANUFACTURING WORKFLOWS

File transfer in UNICORE State of the art

White Paper. Optimizing the Performance Of MySQL Cluster

Hadoop Cluster Applications

Network Contention Aware HPC job scheduling with Workload Precongnition

GT 6.0 GRAM5 Key Concepts

PBS Professional Job Scheduler at TCS: Six Sigma- Level Delivery Process and Its Features

Realization of Inventory Databases and Object-Relational Mapping for the Common Information Model

IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li, Software Development Manager IBM

fpafi/tl enterprise Microsoft Silverlight 5 and Windows Azure Enterprise Integration Silverlight Enterprise Applications on the Windows

The Ultimate in Scale-Out Storage for HPC and Big Data

Using the Windows Cluster

The Information Technology Solution. Denis Foueillassar TELEDOS project coordinator

SAP Data Services 4.X. An Enterprise Information management Solution

Online Transaction Processing in SQL Server 2008

Cluster, Grid, Cloud Concepts

SLURM Workload Manager

The Seven Keys to Successfully Managing Compute Workloads at Financial Institutions

Transcription:

Mitglied der Helmholtz-Gemeinschaft System monitoring with LLview and the Parallel Tools Platform November 25, 2014 Carsten Karbach

Content 1 LLview 2 Parallel Tools Platform (PTP) 3 Latest features 4 Future Development November 25, 2014 Carsten Karbach 2 26

Mitglied der Helmholtz-Gemeinschaft Part I: LLview November 25, 2014 Carsten Karbach

1. LLview Why system monitoring? For administrators Global overview of system utilization Throughput optimization Batch system configuration optimization Adaptive change of scheduling parameters For users Controlling own running and waiting jobs Planning job submissions Use of idling resources LLview Compact display of all usage data in one window Easy access to system s status data Interactive display for linking information November 25, 2014 Carsten Karbach 4 26

1. LLview LLview Visualizes supercomputer status on a single screen Source: Screenshot LLview for JUQUEEN (5.9 PFlops, 458K cores, LoadLeveler) November 25, 2014 Carsten Karbach 5 26

1. LLview LLview Example: JUROPA Source: Screenshot LLview for JUROPA (207 TFlops, 26K cores, Moab/Torque) November 25, 2014 Carsten Karbach 6 26

1. LLview Monitoring Architecture I Supercomputer Client llview-client update request LML data gatherer LML_da LML response PTP November 25, 2014 Carsten Karbach 7 26

1. LLview Monitoring Architecture II LML da gathers status information, calls target system s remote commands, written in Perl Automatic deploy of LML da, no installation required LML is a data format for status information of supercomputers LML request: requested data and layout LML response: contains the request and status information LML as abstraction layer thin clients, re-use of LML da functions November 25, 2014 Carsten Karbach 8 26

1. LLview LLview Architecture Details Client-Server architecture, LML da as backend Wide range of supported batch systems, minimal effort for extension Minor performance impact on monitored system, only central batch system is queried November 25, 2014 Carsten Karbach 9 26

Mitglied der Helmholtz-Gemeinschaft Part II: PTP November 25, 2014 Carsten Karbach

2. PTP PTP Parallel Tools Platform What is PTP? IDE for parallel application development Based on Eclipse Open-source project Developers: IBM, U.Oregon, UTK, Heidelberg University, NCSA, UIUC, JSC,... November 25, 2014 Carsten Karbach 11 26

2. PTP PTP with LLview Main components of LLview in PTP s monitoring perspective Monitoring example: JUQUEEN November 25, 2014 Carsten Karbach 12 26

2. PTP PTP monitoring features Support for many target systems: Loadleveler, Torque, PBS, Grid Engine, SLURM, LSF Authentication and communication via SSH Client-server architecture LML (large-scale system markup language) as communication language Integrated into PTP workflow Job monitoring Job cancellation View job output data Automatic deployment of LML da November 25, 2014 Carsten Karbach 13 26

2. PTP Scalability 1 Scalable server scripts LML da Select only required data 2 Scalable data format LML Data compression Redundancy avoidance Uses system hierarchy 3 Scalable visualization PTP Filter status data Levels of detail Zoom function November 25, 2014 Carsten Karbach 14 26

2. PTP Monitoring example: JUROPA November 25, 2014 Carsten Karbach 15 26

2. PTP Monitoring example: JUDGE November 25, 2014 Carsten Karbach 16 26

2. PTP Development history 2009 Start of 3-years project on A Scalable Development Environment for Peta-Scale Computing 2010 Design of LML and integration into LLview 2011 First PTP release including system monitoring 2013 Collaboration with NCSA and IBM within Seed Fund Project Advanced System Monitoring for PTP November 25, 2014 Carsten Karbach 17 26

Mitglied der Helmholtz-Gemeinschaft Part III: Latest features November 25, 2014 Carsten Karbach

3. Latest features Node level metrics Add compute node attribute visualization Enrich LML data with metrics like temperature, memory/power usage and load Use nodes display and color map for visualization November 25, 2014 Carsten Karbach 19 26

3. Latest features Node level metrics Add compute node attribute visualization Enrich LML data with metrics like temperature, memory/power usage and load Use nodes display and color map for visualization November 25, 2014 Carsten Karbach 19 26

3. Latest features Server Caching multiple users on the same target system cache LML file in public directory (e.g. /tmp), use LML cache as data source default: each client triggers independent status data update caching: daemon retrieves status data, clients use cached data Cache workflow November 25, 2014 Carsten Karbach 20 26

3. Latest features Advantages of caching Faster response time System Cache [s] No Cache [s] JUDGE 1.1 2.8 JUROPA 5.8 11.1 JUQUEEN 1.7 34.4 Recording of history data is possible Enhancement with data not accessible to normal user Decreased load for the system November 25, 2014 Carsten Karbach 21 26

Mitglied der Helmholtz-Gemeinschaft Part IV: Future development November 25, 2014 Carsten Karbach

4. Future development Customized LML layouts understand system architecture and hierarchy map topology into LML layout advantages: level of detail, automatic job filtering, display node names, improved performance workplan: tutorial on LML layouts, contact system administrators of partner XSEDE/PRACE sites, support writing of LML layouts, ask for feedback November 25, 2014 Carsten Karbach 23 26

4. Future development LLview diagrams histograms prediction load history Challenges Re-implementation in Java? Double update for LLview and PTP November 25, 2014 Carsten Karbach 24 26

4. Future development First implementation step November 25, 2014 Carsten Karbach 25 26

4. Future development Conclusion Monitoring is vital for production of HPC systems LLview visualizes system status in a single screen PTP integrates monitoring into development workflow Future development will optimize PTP s monitoring workflow for large scale systems November 25, 2014 Carsten Karbach 26 26

Proven scalability PTP/LLview were successfully tested on: System Batch system Cores Peak #Jobs JUQUEEN LoadLeveler 458K 5.9 PF 100-1000 Jaguar Torque/Moab 224K 1.8 PF Mogon LSF 34K 204 TF -15000 JUROPA Torque/Moab 26K 207 TF 200-2000 Lonestar SGE 22K 311 TF November 25, 2014 Carsten Karbach 27 26

Mitglied der Helmholtz-Gemeinschaft Part I: Examples for customized layouts November 25, 2014 Carsten Karbach

1. Examples for customized layouts Example JUROPA Customized layout four level hierarchy: partition, rack, node, core level of detail (only with layout) better overview November 25, 2014 Carsten Karbach 29 26

1. Examples for customized layouts Example JUQUEEN Default layout November 25, 2014 Carsten Karbach 30 26

1. Examples for customized layouts Example JUQUEEN Customized layout colored rows do not display rack names November 25, 2014 Carsten Karbach 31 26

Mitglied der Helmholtz-Gemeinschaft Part II: JuFo November 25, 2014 Carsten Karbach

2. JuFo Overview Configurable simulator for global job schedulers for on-line prediction of job dispatch dates Based on analysis of JSC batch systems Moab and Loadleveler Integrated with monitoring system LLview LML as configuration and communication data format Use-cases: User predicts start dates of submitted jobs Administrator simulates job scheduler performance with various input parameters resources Resource limit Job 2 Job 1 BW 1 BW 2 Res 1 BW 3 Res 2 current position time [s] November 25, 2014 Carsten Karbach 33 26

2. JuFo Architecture Supercomputer step 1 step 2 Raw- LML scheduler simulator Client llview-client data gatherer LML_da Raw- LML extends JuFo PTP step 3 LML November 25, 2014 Carsten Karbach 34 26

2. JuFo Features Supported scheduling algorithms First-Come-First-Served List-Scheduling Backfilling Available simulation parameters Generic job prioritization Advanced reservations Jobs can request CPUs, GPUs, memory Nodesharing Queue constraints Test framework for evaluating JuFo s accuracy November 25, 2014 Carsten Karbach 35 26