Site Meta-Monitoring The HappyFace Project G. Quast, A. Scheurer, M. Zvada CMS Monitoring Review, 16. 17. November 2010 KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
The HappyFace Project Why Do We Need It? and What Are The Benefits? 2 16.11.2010 Dr. Armin Scheurer
HappyFace (HF) Basics HF - What it is: Allows real-time site monitoring Acquires information automatically, not on demand! Can be used as sophisticated and modular shift tool and by admins Auto-update system, no user intervention necessary Provides rating system (Keyword: non-expert shift crews) Allows to correlate information Can trigger automatic alarms/notifications Highly configurable and adjustable Meta-monitoring Suite HF What it is not: A plain, stand-alone monitoring tool A competition to Dashboard 3 16.11.2010 Dr. Armin Scheurer
Background Information HF is: Written in Python Highly modular DB backend Intrinsic history functionality Lightweight, fast and reliable HF core and module development and maintenance: KIT Karlsruhe (HF core) DESY/University of Hamburg RWTH Aachen University of Göttingen 4 16.11.2010 Dr. Armin Scheurer
Live Demonstration http://www-ekp.physik.uni-karlsruhe.de/~happyface/gridka/ 5 16.11.2010 Dr. Armin Scheurer
The Interface & Rating System 1. History Navigation 2. Category Navigation 3. Module Navigation 4. Module Content 1. 2. 3. 4. Simple module and category rating system 6 16.11.2010 Dr. Armin Scheurer
Module Information History Plot Detailed information about each module available, including: Current status, warning and critical thresholds, link to the information source, instructions for shifters what to do in case of problems 7 16.11.2010 Dr. Armin Scheurer
Some Selected Modules Batch System Real-time batch system monitoring Use batch system xml provider (Subir) See currently running batch jobs Calculate current job efficiency Define warning/ error thresholds Allows logical and and or conditions 8 16.11.2010 Dr. Armin Scheurer
Some Selected Modules Storage System HF provides modules to monitor dcache systems Use standard dcache xml provider Can be used by any dcache site Monitor overall dcache status, down to the point of individual pools Monitor current I/O throughput and status of active/queued transfers Free space And many more features 9 16.11.2010 Dr. Armin Scheurer
Some Selected Modules Data Management HF module for data management See a list of all datasets available at a site Calculates the used space on disk Provides information about dataset distribution on the storage system (on disk, only on tape) Thresholds mark datasets green Allow/disallow user access on datasets which are not staged 10 16.11.2010 Dr. Armin Scheurer
Some Selected Modules RSS Feeds HF is a shifter tool it provides RSS feed functionality Inform all shifters about ongoing issues Keep track of open tickets, etc. Very useful during shift changeover 11 16.11.2010 Dr. Armin Scheurer
Extended Functionality HF publishes its current status via xml Used as input for a plugin available for the Firefox web browser statusbar Used as input for smartphone apps (e.g. iphone, Android) Used for Meta² - Monitoring, ideal for central shifts the HF matrix By clicking on the individual arrows, directly jump to the proper HF instance and module 12 16.11.2010 Dr. Armin Scheurer
Prototype: CMS Tier1 Batch System Monitoring Individual categories for each Tier1, e.g. KIT FNAL etc. Combined or individual categories for Tier2s Easily extendable to integrate e.g. storage system monitoring, etc. Everything in one view, cached and thus very fast access to all precollected information 13 16.11.2010 Dr. Armin Scheurer
List of Available Modules Additional existing modules: Access to dcache billing database, e.g. last file access, most/least used datasets, etc. SAM test: OPS and VO-specific results User-space monitoring for Tier2s (access control via certificate) Local computing hardware and software infrastructure surveillance: e.g. VO software area, VOBox, ILO interfaces, network connection tests, etc. HF provides an interface to Nagios to include sensors in the status calculation PhEDEx agent status, transfer quality, link status, etc. CMS Site Readiness status Consistency modules: local filespace vs. DBS and/or TMDB Collectors for binary information: e.g. plots from Dashboard, PhEDEx, local services (Ganglia), etc. Atlas Panda monitoring Shift features: module providing further monitoring, documentation and contact links 14 16.11.2010 Dr. Armin Scheurer
Summary HF What it is: Modular and easily configurable tool for shifters and admins All information is pre-collected (time interval: ~10 min) No waiting time, live feeling Stores information from external sources (plots, xml/html, text files ) Stores configuration parameters with the data to allow a consistent history view even after threshold changes, etc. Identify problems: possibility to get exact state of my site on Sunday night Provides rating system, allows to define complex correlations Hides unnecessary information but provides access to all details if required Possibility to automatically trigger alarms/notifications Exports it s status via xml for further usage/harvesting (e.g. iphone app, Firefox plugin, Meta²-Monitoring Used by German CMS and Atlas Sites for more than 2 years now Stable, reliable, tested Many modules available designed for collaboration via central repository 15 16.11.2010 Dr. Armin Scheurer
The HappyFace Project More Information & Documentation: https://ekptrac.physik.uni-karlsruhe.de/trac/happyface 16 16.11.2010 Dr. Armin Scheurer