Das HappyFace Meta-Monitoring Framework B. Berge, M. Heinrich, G. Quast, A. Scheurer, M. Zvada, DPG Frühjahrstagung Karlsruhe, 28. März 1. April 2011 KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
HappyFace (HF) Basics HF - What it is: Allows real-time site monitoring Acquires information automatically, not on demand! Can be used as sophisticated and modular shift tool Provides as well detailed information for admins (if required) Auto-refresh system, no user intervention necessary Provides rating system (Keyword: non-expert shift crews) Allows to correlate information Can trigger automatic alarms/notifications Highly configurable and adjustable e.g. certificate based access control Meta-monitoring Suite 2 30.03.2011 Dr. Armin Scheurer
Technical Background HF is: Written in Python Highly modular Similar modules inherit functionality Configuration possible on each inheritance level, e.g. 5 modules which provide CMS Dashboard information for T1_DE_KIT can be changed to monitor the site T2_DE_DESY by just altering one line in the parent config file. DB assisted Intrinsic history functionality Lightweight, fast and reliable In production use since more than 2 years now Easy Deployment on New Sites! 3 30.03.2011 Dr. Armin Scheurer
The Interface & Rating System 1. History Navigation 2. Category Navigation 3. Module Navigation 4. Module Content 1. 2. 3. Simple module and category rating system 4. 4 30.03.2011 Dr. Armin Scheurer
HF CMS and ATLAS partners HF core and module development: KIT Karlsruhe Module development/usage University of Hamburg DESY Hamburg RWTH Aachen University of Göttingen 5 30.03.2011 Dr. Armin Scheurer
Selected Module Batch System Monitoring Real-time batch system monitoring Use batch system xml provider (CMS has providers for e.g. PBS, LSF, Condor) See currently running batch jobs Calculate current job efficiency Define warning/ error thresholds 6 30.03.2011 Dr. Armin Scheurer
Prototype: CMS Tier1 Batch System Monitoring Individual categories for each Tier1, e.g. KIT FNAL etc. Combined or individual categories for Tier2s Easily extendable to integrate e.g. storage system monitoring, etc. Everything in one view, cached and thus very fast access to all precollected information 7 30.03.2011 Dr. Armin Scheurer
HF Access Control Certificate-based access control (e.g. Grid certificate) Access can be restricted for single modules or whole categories Hidden mode Use one single HF instance for admins and users! 8 30.03.2011 Dr. Armin Scheurer
Extended Functionality HF publishes its current status via XML Used as input for a plugin available for the Firefox web browser statusbar Used as input for smartphone apps (e.g. iphone, Android) Used for Meta² - Monitoring, ideal for central shifts the HF matrix By clicking on the individual arrows, directly jump to the proper HF instance and module 9 30.03.2011 Dr. Armin Scheurer
Current Development: Database Backend Standard HF DB backend - SQLite: Pro: Con: Lightweight, file-based, well supported, perfectly suited for most HF sites Huge files backup difficult, performance scalability Solution: introduce support for arbitrary DB backends E.g. Postgres, MySQL, Oracle, etc. Each site can use a preferred DB backend to support their own setup (e.g. Oracle cluster at CERN) Allows site-specific scalable performance optimisations 10 30.03.2011 Dr. Armin Scheurer
Summary HF What it is: Modular and easily configurable tool for shifters and admins All information is pre-collected (time interval: ~10 min) No waiting time, live feeling Stores information from external sources (plots, XML/HTML, text files ) Stores configuration parameters with the data to allow a consistent history view even after threshold changes, etc. Identify problems: possibility to get exact state of my site on Sunday night Provides powerful rating system (different algorithms available) Possibility to automatically trigger alarms/notifications Exports its status via XML for further usage/harvesting (e.g. iphone app, Firefox plugin, Meta²-Monitoring) Used by German CMS and ATLAS sites for more than 2 years now Stable, reliable, tested Many modules available designed for collaboration via central repository 11 30.03.2011 Dr. Armin Scheurer
The HappyFace Project More Information & Documentation: https://ekptrac.physik.uni-karlsruhe.de/trac/happyface 12 30.03.2011 Dr. Armin Scheurer
Backup Slides 13 30.03.2011 Dr. Armin Scheurer
Detailed Module Information History Plot Detailed information about each module available, including: Current status, warning and critical thresholds, link to the information source, instructions for shifters what to do in case of problems 14 30.03.2011 Dr. Armin Scheurer
Some Selected Modules Data Management HF module for data management See a list of all datasets available at a site Calculates the used space on disk Provides information about dataset distribution on the storage system (on disk, only on tape) Thresholds mark datasets green Allow/disallow user access on datasets which are not staged 15 30.03.2011 Dr. Armin Scheurer
Some Selected Modules Storage System Karlsruhe HF provides modules to monitor dcache systems Use standard dcache xml provider Can be used by any dcache site Monitor overall dcache status, down to the point of individual pools Monitor current I/O throughput and status of active/queued transfers Free space And many more features 16 30.03.2011 Dr. Armin Scheurer
Some Selected Modules RSS Feeds HF is a shifter tool it provides RSS feed functionality Inform all shifters about ongoing issues Keep track of open tickets, etc. Very useful during shift changeover 17 30.03.2011 Dr. Armin Scheurer
List of Available Modules Additional existing modules: Access to dcache billing database, e.g. last file access, most/least used datasets, etc. SAM test: OPS and VO-specific results User-space monitoring for Tier2s (access control via certificate) Local computing hardware and software infrastructure surveillance: e.g. VO software area, VOBox, ILO interfaces, network connection tests, etc. HF provides an interface to Nagios to include sensors in the status calculation PhEDEx agent status, transfer quality, link status, etc. CMS Site Readiness status Consistency modules: local filespace vs. DBS and/or TMDB Collectors for binary information: e.g. plots from Dashboard, PhEDEx, local services (Ganglia), etc. Atlas Panda monitoring Shift features: module providing further monitoring, documentation and contact links 18 30.03.2011 Dr. Armin Scheurer