HappyFace for CMS Tier-1 local job monitoring G. Quast, A. Scheurer, M. Zvada CMS Offline & Computing Week CERN, April 4 8, 2011 INSTITUT FÜR EXPERIMENTELLE KERNPHYSIK, KIT 1 KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Name of Institute, Faculty, Department www.kit.edu
CMS T1 local job monitoring Initiative: CMS encourages sites to provide local monitoring information for more reliable job tracking All the T1s do have their own monitoring tools content tool specific some tools do create XML, but format varies A uniform view of Tier-1 local job monitoring missing (applies equally to Tier-2s) that's exactly what this project covers Idea is simple define common format for local job monitoring (XML) More details about plans for Tier-1 monitoring and common XML (effort started by Andrea Sciaba) https://twiki.cern.ch/bin/viewauth/cms/cmstier1monitoringproject 2
Towards a uniform XML view Define common XML format https://twiki.cern.ch/twiki/bin/viewauth/cms/cmsfarmmonitoringcommonxmlformat Define DB backend and presentation layer Enters HappyFace project https://ekptrac.physik.uni-karlsruhe.de/trac/happyface/ Common idea 3 T1s publish monitoring information in the common format (in addition to the original monitoring) Develop XML producer for PBS, condor, LSF (thanks Subir!, also Andrew Lahiff of RAL) provide producer to T1s in order to publish info Use HappyFace module read the XML, then visualize the information and store it into its database (HF developers @KIT) Decoupling of data generation and visualization Information for all the T1s integrated in a single page Warn if there is a significant number of inefficient jobs in a group
The HappyFace Project Allows quasi real-time site monitoring Acquires information automatically, not on demand! Can be used as sophisticated and modular shift tool Provides as well detailed information for admins (if required) Auto-refresh system, no user intervention necessary Provides rating system (Keyword: non-expert shift crews) Allows to correlate information Can trigger automatic alarms/notifications Highly configurable and adjustable e.g. certificate based access control Meta-monitoring Suite It is not: 4 A plain, stand-alone monitoring tool A competitor to Dashboard
Technical Background HF is: Written in Python Highly modular DB backend Intrinsic history functionality Lighweight, fast and reliable In production use since more than 2 years now HF core module development and maintenance: KIT (HF core) DESY/Uni of Hamburg RWTH Aachen University of Göttingen Easy Deployment on New Sites! 5
The HappyFace Web Front-End 6
Common XML format for T1 Jobs monitoring <?xml version= 1.0 encoding= UTF-8?> <jobinfo> Opening of root element <header> <batch>pbs</batch> <date>1284463501</date> <site>t1_de_kit</site> </header> header provides Metadata information: [...] Main content, see next slides <additional> [...] </additional> Any additional data that might be interesting for a given batch system. This is not processed any further by tools such as HF. </jobinfo> 7 XML Declaration Batch system used, site at which the jobs run and timestamp of the generation of the XML Document end
Job summaries <summaries> <summary group= cms parent= all > <jobs>2345</jobs> <running>1324</running> <pending>590</pending> <held>431</held> <ratio10>73</ratio10> <cpueff>81.92</cpueff> <cputime>18628651</cputime> <walltime> 22741071 </walltime> </summary> summaries contains summary elements each of which specifies jobs summary of a certain group. The parent attribute allows for a group hierarchy; all jobs in a given group also belong to its parent group. The special group all contains all jobs. <!-- more summaries of other groups --> </summaries> 8 jobs: Total number of jobs in group running: Number of running jobs pending: Number of jobs queued held: Number of jobs held ratio10: Number of jobs with cputime over walltime below 10% cpueff: cputime over walltime cputime: Total CPU time of all jobs in the group walltime: Total walltime of all jobs in the group
Individual jobs information <jobs> <job group= cmsprod > <id>osg-gw-[...]</id> <state>running</state> <user>cmsprod</user> <group>cmsprod</group> <created>1284428116</created> <start>1284445509</start> <end>n/a</end> <queue>cmsprod</queue> <cputime>17864</cputime> <walltime>19190</walltime> <cpueff>93.09</cpueff> <cpupercent>94.31</cpupercent> <exec_host> cabinet-1-1-3.t2.ucsd.edu </exec_host> <ce>osg-gw-4.t2.ucsd.edu</ce> </job> <!-- more jobs --> </jobs> 9 jobs contains job elements each of which specifies information about one particular job: id: ID of the job as given by batch sys state: One of running, pending, held, exited, waiting, suspended user: User running the job group: Group of the job created: Timestamp of submission start: Timestamp of job starting date end: Timestamp of job end date (currently always n/a) queue: Queue of the job cputime: Total CPU time used walltime: Total walltime used cpueff: cputime over walltime cpupercent: Current CPU usage exec_host: Name of worker node ce: Submitting compute element
HappyFace module for jobs statistics (I) CMS to use HappyFace for production job monitoring of CMS jobs at all T1s http://www-ekp.physik.uni-karlsruhe.de/~happyface/cmst1prodmon/webpage/ 10 Each T1 sets up an XML producer for their respective batch system HappyFace aggregates the information and visualizes it all at one place One category for each site T2 jobs can also be shown in another category
HappyFace module for jobs statistics (II) Group hierarchy Module Rating based on inefficient jobs in cms group 11
History Plots The history of each module's status and, depending on the module, other numerical variables can be plotted. Here: Number of Jobs at GridKa 12 All jobs Running jobs Jobs with walltime ratio below 10% History navigation interval feature: with a static link, good to place on TWiki sites, etc.
More features HF access control 13 Certificate-based access control Access can be restricted for single modules or whole categories Hidden mode Use one single HF instance for admins and users!
More features extended functionality HF publishes its current status via XML 14 Used as input for a plugin available for the Firefox web browser statusbar Used as input for smartphone apps (e.g. iphone, Android) Used for Meta² - Monitoring, ideal for central shifts the HF matrix
Current HF development: Database Backend 15
Roadmap and summary 16 Deploy service and maintain at CERN by FacOps Support HF core code by KIT
Thank you! HappyFace project page https://ekptrac.physik.uni-karlsruhe.de/trac/happyface/ HappyFace view of uniform CMS local job monitoring http://www-ekp.physik.uni-karlsruhe.de/~happyface/cmst1prodmon/webpage/ Documentation for common XML format https://twiki.cern.ch/twiki/bin/viewauth/cms/cmsfarmmonitoringcommonxmlformat Questions? 17
Backup slides 18
Module information 19
Example modules: dcache Data Transfers module Watch file transfers from dcache pools 20 Warns if transfers take much time Warns if transfers are slow Indicator for inefficient jobs Correlation with problematic pools or host systems is revealed
Example modules: dcache Data Management module Compare files known by dcache with the CMS DBS database For each file known by dcache check which dataset it belongs to Serves as a consistency cross check Datasets fully on disk are interesting to enduser analyses Size of dataset in #files and GB Part of dataset present on disk Total space occupied on disk, including replicas Orphaned files (available in dcache but not in any dataset) 21
Example modules: RSS module Allows to display RSS feeds as a non-rated HappyFace module Inform shifters about general news or known problems Useful for external monitoring sources and service portals which use RSS 22 For example, show open tickets of relevant services