HappyFace for CMS Tier-1 local job monitoring



Similar documents
How To Use Happyface (Hf) On A Network (For Free)

Das HappyFace Meta-Monitoring Framework

HappyFace3 everything now in Python ;-)

Site specific monitoring of multiple information systems the HappyFace Project

ATLAS job monitoring in the Dashboard Framework

Performance Monitor. Intellicus Web-based Reporting Suite Version 4.5. Enterprise Professional Smart Developer Smart Viewer

Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept

ARDA Experiment Dashboard

Computing in High- Energy-Physics: How Virtualization meets the Grid

The CMS analysis chain in a distributed environment

Features of AnyShare

The Data Quality Monitoring Software for the CMS experiment at the LHC

Maintaining Non-Stop Services with Multi Layer Monitoring

Grid Computing in Aachen

CRM. itouch Vision. This document gives an overview of OneTouch Cloud CRM and discusses the different features and functionality.

Status and Evolution of ATLAS Workload Management System PanDA

Mobile App Framework For any Website

The Complete Performance Solution for Microsoft SQL Server

HTCondor at the RAL Tier-1

KPACK: SQL Capacity Monitoring

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

TECHNOLOGY WHITE PAPER Jun 2012

Welcome The webinar will begin shortly

Monitoring SAP Business Objects

Transaction Monitoring Version for AIX, Linux, and Windows. Reference IBM

Grid Scheduling Dictionary of Terms and Keywords

An objective comparison test of workload management systems

Welcome to the second half ofour orientation on Spotfire Administration.

Cloud Powered Mobile Apps with Azure

Springpath Data Platform

Design Document. Offline Charging Server (Offline CS ) Version i -

SharePoint 2010 Interview Questions-Architect

A Near Real-Time Personalization for ecommerce Platform Amit Rustagi

SCF/FEF Evaluation of Nagios and Zabbix Monitoring Systems. Ed Simmonds and Jason Harrington 7/20/2009

CLOUD BASED SERVICE (CBS STORAGE)

skype ID: store.belvg US phone number:

Virtualization of a Cluster Batch System

Integrating VoltDB with Hadoop

AvePoint Meetings for SharePoint On-Premises. Installation and Configuration Guide

Database Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group

ArcGIS for Server Performance and Scalability: Testing Methodologies. Andrew Sakowicz, Frank Pizzi,

Deploying complex applications to Google Cloud. Olia Kerzhner

Monitoring Remedy with BMC Solutions

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Security Analytics Topology

WordPress File Monitor Plus Plugin Configuration

RSA Authentication Manager 8.1 Virtual Appliance Getting Started

Publish Acrolinx Terminology Changes via RSS

1. INTERFACE ENHANCEMENTS 2. REPORTING ENHANCEMENTS

Site Management Abandoned Shopping Cart Report Best Viewed Products Report Control multiple websites and stores from one

Solution Manager 7.1 Technical Monitoring Unified Dashboard Configuration Guide

LICENTIA. Nuntius. Magento Marketing Extension REVISION: SEPTEMBER 21, 2015 (V1.8.1)

StruxureWare TM Center Expert. Data

Mirjam van Olst. Best Practices & Considerations for Designing Your SharePoint Logical Architecture

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Trainer name is P. Ranjan Raja. He is honour of and he has 8 years of experience in real time programming.

Integration of IT-DB Monitoring tools into IT General Notification Infrastructure

Martinos Center Compute Clusters

Distributed Database Access in the LHC Computing Grid with CORAL

Database FAQs - SQL Server

Testing & Assuring Mobile End User Experience Before Production. Neotys

Extending The Value of SAP with the SAP BusinessObjects Business Intelligence Platform Product Integration Roadmap

Pharos Control User Guide

Patrick Desbrow VP, Engineering

ultimo theme Update Guide Copyright Infortis All rights reserved

Cloud Services MDM. Overview & Setup Admin Guide

DocAve Online 3. User Guide. Service Pack 9 Cumulative Update 1

Adobe Summit 2015 Lab 718: Managing Mobile Apps: A PhoneGap Enterprise Introduction for Marketers

Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization

XpoLog Competitive Comparison Sheet

For a full comparison of Magento Enterprise and Magento Community, visit Magento Feature List

The dashboard Grid monitoring framework

Content Management Systems: Drupal Vs Jahia

Monitoring Agent for Microsoft Exchange Server Fix Pack 9. Reference IBM

WordPress Security Scan Configuration

SQL Sentry Essentials

The Agile Infrastructure Project. Monitoring. Markus Schulz Pedro Andrade. CERN IT Department CH-1211 Genève 23 Switzerland

LICENTIA. Nuntius. Magento Marketing Extension REVISION: THURSDAY, JUNE 2, 2016 (V )

SysPatrol - Server Security Monitor

Page 1. Overview of System Architecture

A Close Look at Drupal 7

Transcription:

HappyFace for CMS Tier-1 local job monitoring G. Quast, A. Scheurer, M. Zvada CMS Offline & Computing Week CERN, April 4 8, 2011 INSTITUT FÜR EXPERIMENTELLE KERNPHYSIK, KIT 1 KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Name of Institute, Faculty, Department www.kit.edu

CMS T1 local job monitoring Initiative: CMS encourages sites to provide local monitoring information for more reliable job tracking All the T1s do have their own monitoring tools content tool specific some tools do create XML, but format varies A uniform view of Tier-1 local job monitoring missing (applies equally to Tier-2s) that's exactly what this project covers Idea is simple define common format for local job monitoring (XML) More details about plans for Tier-1 monitoring and common XML (effort started by Andrea Sciaba) https://twiki.cern.ch/bin/viewauth/cms/cmstier1monitoringproject 2

Towards a uniform XML view Define common XML format https://twiki.cern.ch/twiki/bin/viewauth/cms/cmsfarmmonitoringcommonxmlformat Define DB backend and presentation layer Enters HappyFace project https://ekptrac.physik.uni-karlsruhe.de/trac/happyface/ Common idea 3 T1s publish monitoring information in the common format (in addition to the original monitoring) Develop XML producer for PBS, condor, LSF (thanks Subir!, also Andrew Lahiff of RAL) provide producer to T1s in order to publish info Use HappyFace module read the XML, then visualize the information and store it into its database (HF developers @KIT) Decoupling of data generation and visualization Information for all the T1s integrated in a single page Warn if there is a significant number of inefficient jobs in a group

The HappyFace Project Allows quasi real-time site monitoring Acquires information automatically, not on demand! Can be used as sophisticated and modular shift tool Provides as well detailed information for admins (if required) Auto-refresh system, no user intervention necessary Provides rating system (Keyword: non-expert shift crews) Allows to correlate information Can trigger automatic alarms/notifications Highly configurable and adjustable e.g. certificate based access control Meta-monitoring Suite It is not: 4 A plain, stand-alone monitoring tool A competitor to Dashboard

Technical Background HF is: Written in Python Highly modular DB backend Intrinsic history functionality Lighweight, fast and reliable In production use since more than 2 years now HF core module development and maintenance: KIT (HF core) DESY/Uni of Hamburg RWTH Aachen University of Göttingen Easy Deployment on New Sites! 5

The HappyFace Web Front-End 6

Common XML format for T1 Jobs monitoring <?xml version= 1.0 encoding= UTF-8?> <jobinfo> Opening of root element <header> <batch>pbs</batch> <date>1284463501</date> <site>t1_de_kit</site> </header> header provides Metadata information: [...] Main content, see next slides <additional> [...] </additional> Any additional data that might be interesting for a given batch system. This is not processed any further by tools such as HF. </jobinfo> 7 XML Declaration Batch system used, site at which the jobs run and timestamp of the generation of the XML Document end

Job summaries <summaries> <summary group= cms parent= all > <jobs>2345</jobs> <running>1324</running> <pending>590</pending> <held>431</held> <ratio10>73</ratio10> <cpueff>81.92</cpueff> <cputime>18628651</cputime> <walltime> 22741071 </walltime> </summary> summaries contains summary elements each of which specifies jobs summary of a certain group. The parent attribute allows for a group hierarchy; all jobs in a given group also belong to its parent group. The special group all contains all jobs. <!-- more summaries of other groups --> </summaries> 8 jobs: Total number of jobs in group running: Number of running jobs pending: Number of jobs queued held: Number of jobs held ratio10: Number of jobs with cputime over walltime below 10% cpueff: cputime over walltime cputime: Total CPU time of all jobs in the group walltime: Total walltime of all jobs in the group

Individual jobs information <jobs> <job group= cmsprod > <id>osg-gw-[...]</id> <state>running</state> <user>cmsprod</user> <group>cmsprod</group> <created>1284428116</created> <start>1284445509</start> <end>n/a</end> <queue>cmsprod</queue> <cputime>17864</cputime> <walltime>19190</walltime> <cpueff>93.09</cpueff> <cpupercent>94.31</cpupercent> <exec_host> cabinet-1-1-3.t2.ucsd.edu </exec_host> <ce>osg-gw-4.t2.ucsd.edu</ce> </job> <!-- more jobs --> </jobs> 9 jobs contains job elements each of which specifies information about one particular job: id: ID of the job as given by batch sys state: One of running, pending, held, exited, waiting, suspended user: User running the job group: Group of the job created: Timestamp of submission start: Timestamp of job starting date end: Timestamp of job end date (currently always n/a) queue: Queue of the job cputime: Total CPU time used walltime: Total walltime used cpueff: cputime over walltime cpupercent: Current CPU usage exec_host: Name of worker node ce: Submitting compute element

HappyFace module for jobs statistics (I) CMS to use HappyFace for production job monitoring of CMS jobs at all T1s http://www-ekp.physik.uni-karlsruhe.de/~happyface/cmst1prodmon/webpage/ 10 Each T1 sets up an XML producer for their respective batch system HappyFace aggregates the information and visualizes it all at one place One category for each site T2 jobs can also be shown in another category

HappyFace module for jobs statistics (II) Group hierarchy Module Rating based on inefficient jobs in cms group 11

History Plots The history of each module's status and, depending on the module, other numerical variables can be plotted. Here: Number of Jobs at GridKa 12 All jobs Running jobs Jobs with walltime ratio below 10% History navigation interval feature: with a static link, good to place on TWiki sites, etc.

More features HF access control 13 Certificate-based access control Access can be restricted for single modules or whole categories Hidden mode Use one single HF instance for admins and users!

More features extended functionality HF publishes its current status via XML 14 Used as input for a plugin available for the Firefox web browser statusbar Used as input for smartphone apps (e.g. iphone, Android) Used for Meta² - Monitoring, ideal for central shifts the HF matrix

Current HF development: Database Backend 15

Roadmap and summary 16 Deploy service and maintain at CERN by FacOps Support HF core code by KIT

Thank you! HappyFace project page https://ekptrac.physik.uni-karlsruhe.de/trac/happyface/ HappyFace view of uniform CMS local job monitoring http://www-ekp.physik.uni-karlsruhe.de/~happyface/cmst1prodmon/webpage/ Documentation for common XML format https://twiki.cern.ch/twiki/bin/viewauth/cms/cmsfarmmonitoringcommonxmlformat Questions? 17

Backup slides 18

Module information 19

Example modules: dcache Data Transfers module Watch file transfers from dcache pools 20 Warns if transfers take much time Warns if transfers are slow Indicator for inefficient jobs Correlation with problematic pools or host systems is revealed

Example modules: dcache Data Management module Compare files known by dcache with the CMS DBS database For each file known by dcache check which dataset it belongs to Serves as a consistency cross check Datasets fully on disk are interesting to enduser analyses Size of dataset in #files and GB Part of dataset present on disk Total space occupied on disk, including replicas Orphaned files (available in dcache but not in any dataset) 21

Example modules: RSS module Allows to display RSS feeds as a non-rated HappyFace module Inform shifters about general news or known problems Useful for external monitoring sources and service portals which use RSS 22 For example, show open tickets of relevant services