ARDA Experiment Dashboard



Similar documents
The dashboard Grid monitoring framework

CMS Dashboard of Grid Activity

ATLAS job monitoring in the Dashboard Framework

How To Use Happyface (Hf) On A Network (For Free)

Das HappyFace Meta-Monitoring Framework

PoS(EGICF12-EMITC2)110

Distributed Database Access in the LHC Computing Grid with CORAL

Database Services for CERN

Status and Evolution of ATLAS Workload Management System PanDA

Real Time Monitor of Grid Job Executions. Janusz Martyniak Imperial College London

Site specific monitoring of multiple information systems the HappyFace Project

DJRA1.6 FINAL RELEASE OF NEW GRID MIDDLEWARE SERVICES

Global Grid User Support - GGUS - start up schedule

Global Grid User Support - GGUS - in the LCG & EGEE environment

The Data Quality Monitoring Software for the CMS experiment at the LHC

Status and Integration of AP2 Monitoring and Online Steering

The dcache Storage Element

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de

HappyFace for CMS Tier-1 local job monitoring

User and Programmer Guide for the FI- STAR Monitoring Service SE

Big Data for Satellite Business Intelligence

ANALYSIS FUNCTIONAL AND STRESS TESTING

Oracle Identity Analytics Architecture. An Oracle White Paper July 2010

Building a Volunteer Cloud

Monitoring Evolution WLCG collaboration workshop 7 July Pablo Saiz IT/SDC

EDG Project: Database Management Services

CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN

Internet Technologies_1. Doc. Ing. František Huňka, CSc.

PAKITI Patching Status System

CHAPTER 1 - JAVA EE OVERVIEW FOR ADMINISTRATORS

The GENIUS Grid Portal

Open Source Monitoring

Apache CloudStack 4.x (incubating) Network Setup: excerpt from Installation Guide. Revised February 28, :32 pm Pacific

Alternative models to distribute VO specific software to WLCG sites: a prototype set up at PIC

CrownPeak Playbook CrownPeak Hosting with PHP

IBM API Management Overview IBM Corporation

Monitoring Oracle Enterprise Performance Management System Release Deployments from Oracle Enterprise Manager 12c

Agile Best Practices and Patterns for Success on an Agile Software development project.

Clusters in the Cloud

Sun Grid Engine, a new scheduler for EGEE

Report from SARA/NIKHEF T1 and associated T2s

Oracle WebLogic Server 11g Administration

CloudCERT (Testbed framework to exercise critical infrastructure protection)

SIG-NOC Meeting - Stuttgart 04/08/2015 Icinga - Open Source Monitoring

Security: Best Practice and Monitoring

DEVELOPMENT OF AN ANALYSIS AND REPORTING TOOL FOR ORACLE FORMS SOURCE CODES

Solution for private cloud computing

Sisense. Product Highlights.

Web Mapping in Archaeology

HTCondor at the RAL Tier-1

ArcSight Express Administration and Operations Course

ActiveVOS Server Architecture. March 2009

The Automatic HTTP Requests Logging and Replaying Subsystem for CMS Plone

The LHCb Software and Computing NSS/IEEE workshop Ph. Charpentier, CERN

Building Views and Charts in Requests Introduction to Answers views and charts Creating and editing charts Performing common view tasks

SCF/FEF Evaluation of Nagios and Zabbix Monitoring Systems. Ed Simmonds and Jason Harrington 7/20/2009

Lecture 15 - Web Security

Bernd Ahlers Michael Friedrich. Log Monitoring Simplified Get the best out of Graylog2 & Icinga 2

Adam Rauch Partner, LabKey Software Extending LabKey Server Part 1: Retrieving and Presenting Data

Elgg 1.8 Social Networking

SOA Solutions & Middleware Testing: White Paper

Paper Robert Bonham, Gregory A. Smith, SAS Institute Inc., Cary NC

ACEYUS REPORTING. Aceyus Intelligence Executive Summary

PoS(EGICF12-EMITC2)091

Analisi di un servizio SRM: StoRM

Manage Website Template That Using Content Management System Joomla

Transcription:

ARDA Experiment Dashboard Ricardo Rocha (ARDA CERN) on behalf of the Dashboard Team www.eu-egee.org egee INFSO-RI-508833

Outline Background Dashboard Framework VO Monitoring Applications Job Monitoring Site Monitoring / Efficiency Data Management Monitoring Additional Applications Conclusion and Future Work INFSO-RI-508833 To change: View -> Header and Footer 2

Background Started in 2005 inside the EGEE/ARDA group First application: Grid Job / Application Monitoring for the CMS experiment implementation in PHP / Python Redesign early 2006 fully python based solution more modular / structured t approach easily extensible Additional application areas: data management (ATLAS DDM), site efficiency monitoring, INFSO-RI-508833 To change: View -> Header and Footer 3

Dashboard Framework Web / HTTP Interface Agents Dashboard Clients Scripts: pycurl, Command line tools (optparser + pycurl) Shell based: curl, Web Application Apache + mod_python Model View Controller (MVC) pattern multiple output formats: plain text, CSV, XML, XHTML GSI support using gridsite Agents collectors: RGMA, ICXML, BDII, stats generation, alert managers, Data Access Layer (DAO) Service Configurator pattern common configuration (XML file) and management: stop, start, status, list common monitoring mechanism Data Access Layer (DAO) interfaces available to different backends (Oracle and PostgreSQL mainly, easy to add additional ones) connection pooling INFSO-RI-508833 To change: View -> Header and Footer 4

Dashboard Framework Build and development environment based on python distutils (with several extensions) covers code validation, binaries and documentation generation, unit testing and reports automatic build for each of the release branches packaging uses RPMs APT repository available Release procedure three main branches: nightly, unstable, stable releases per component enforced versioning scheme (no manual versioning or tagging, all done via distutils command extensions) Interesting links Developers guide: http://dashb-build.cern.ch/build/nightly/doc/guides/common/html/dev/index.html Savannah Project http://savannah.cern.ch/groups/dashboard p INFSO-RI-508833 To change: View -> Header and Footer 5

Job Monitoring Real time and summary views over the virtual organization (VO) grid jobs Several instances in production serving different communities: CMS, ATLAS, LHCb, Alice, VLMED Various grid information sources used: RGMA GridPP XML files collection LCG BDII Value added: VO specific information through job instrumentation. Using Monalisa's ApMon (CMS), Panda and Ganga monitoring (ATLAS) directly querying VO databases (ex: ATLAS production database) Key Features: sensible merging of information from different sources advanced filtering for different usages (VO manager, site admin, community user) INFSO-RI-508833 To change: View -> Header and Footer 6

Job Monitoring Real time and summary views over the virtual organization (VO) grid jobs Several instances in production serving different communities: CMS, ATLAS, LHCb, Alice, VLMED Various grid information sources used: RGMA GridPP XML files collection LCG BDII Additional VO specific information through job instrumentation. Using Monalisa's ApMon (CMS), Panda and Ganga monitoring (ATLAS) directly querying VO databases (ex: ATLAS production database) Key Features: sensible merging of information from different sources advanced filtering for different usages (VO manager, site admin, community user) INFSO-RI-508833 To change: View -> Header and Footer 7

Job Monitoring Task Monitoring deployed and used in CMS Integration with SAM tests already using the new LCG standards prototype in place Alert mechanism in development HTTP API for publishing job information very easy to integrate with existing tools similar to the mechanism used for data management INFSO-RI-508833 To change: View -> Header and Footer 8

Grid / Site Efficiency Built on top of the job monitoring data Main goal: identify reasons for job failures in sites Uses the information coming from RGMA Available today for the same set of communities: ATLAS, CMS, LHCb, Alice, VLMED Provides both summary and detailed information Current ongoing work provide generic (non VO) specific view over the data INFSO-RI-508833 To change: View -> Header and Footer 9

Grid / Site Efficiency Built on top of the job monitoring data Main goal: identify reasons for job failures in sites Uses the information coming from RGMA Available today for the same set of communities: ATLAS, CMS, LHCb, Alice, VLMED Provides both summary and detailed information Current ongoing work provide generic (non VO) specific view over the data INFSO-RI-508833 To change: View -> Header and Footer 10

Grid / Site Efficiency Built on top of the job monitoring data Main goal: identify reasons for job failures in sites Uses the information coming from RGMA Available today for the same set of communities: ATLAS, CMS, LHCb, Alice, VLMED Provides both summary and detailed information Current ongoing work provide generic (non VO) specific view over the data INFSO-RI-508833 To change: View -> Header and Footer 11

Data Management Tied to the ATLAS Distributed Data Management (DDM) system Used successfully both in the production and Tier0 test environments Data sources: DDM site services: the main source, providing all the transfer and placement information SAM tests: for correlation of DDM results with the state of the grid fabric services Storage space availability: from BDII but soon including other available tools Views over the data: Global: site overview covering different metrics (throughput, files / datasets completed,...); summary of the most common errors (transfer and placement) Detailed: starting from the dataset state, to the state of each of its files, to the history of each single file placement (all state changes) INFSO-RI-508833 To change: View -> Header and Footer 12

Data Management Tied to the ATLAS Distributed Data Management (DDM) system Used successfully both in the production and Tier0 test environments Data sources: DDM site services: the main source, providing all the transfer and placement information SAM tests: for correlation of DDM results with the state of the grid fabric services Storage space availability: from BDII but soon including other available tools Views over the data: Global: site overview covering different metrics (throughput, files / datasets completed,...); summary of the most common errors (transfer and placement) Detailed: starting from the dataset state, to the state of each of its files, to the history of each single file placement (all state changes) INFSO-RI-508833 To change: View -> Header and Footer 13

Data Management Tied to the ATLAS Distributed Data Management (DDM) system Used successfully both in the production and Tier0 test environments Data sources: DDM site services: the main source, providing all the transfer and placement information SAM tests: for correlation of DDM results with the state of the grid fabric services Storage space availability: from BDII but soon including other available tools Views over the data: Global: site overview covering different metrics (throughput, files / datasets completed,...); summary of the most common errors (transfer and placement) Detailed: starting from the dataset state, to the state of each of its files, to the history of each single file placement (all state changes) INFSO-RI-508833 To change: View -> Header and Footer 14

Data Management Other features periodic site behavior reports (sent by email) alerts (on specific errors, when a site goes below a certain threshold,...) Coming soon user specific views (authentication via X509 certificates) my datasets better site summary data: overview of dataset / file states in the site (radar plots), average time in each placement step, additional error summaries python query API module python publish API module (open the tool to other applications / communities) INFSO-RI-508833 To change: View -> Header and Footer 15

Conclusion The Dashboard monitors the grid from the point of view of its communities and focuses on the different user's interests t (managers, admins, end users) Grid information is not enough (additional VO information is invaluable) Framework Flexible and stable: proven by the variety of applications available in production Effort put into install / packaging paid off: first external installation has already been done (VLMED) Future work integration with local monitoring systems (feed summaries back to the site admins) improved alert system adapt to recently defined data exchange / query standards http://dashboard.cern.ch INFSO-RI-508833 To change: View -> Header and Footer 16