Monitoring Evolution WLCG collaboration workshop 7 July Pablo Saiz IT/SDC
|
|
|
- Mariah Jennings
- 10 years ago
- Views:
Transcription
1 Monitoring Evolution WLCG collaboration workshop 7 July 2014 Pablo Saiz IT/SDC
2 Monitoring evolution Past Present Future 2
3 The past Working monitoring solutions Small overlap in functionality Big diversity in tools/operations/development Large amount of people to support it Some parts based on old technologies Unsustainable, due to personnel reduction 3
4 WLCG Monitoring consolidation group Create a team: Mon + Exp. Rep + Ops. Rep + AI Mon Main goals: Reduce complexity, modular design Simplify operations, support and service Common development and deployment Unify, where possible, components We know we can monitor with the current system: How can we do it with less resources? 3 months evaluation + 3 months design + 9 months implementation 4
5 Design report and plan By the end of 2013: Report with summary of current status, and tasks to improve it Tasks: Deprecate unused applications Combine applications/tools Evaluate technologies Improve deployment/operations 5
6 We have a plan! 6
7 Present Working monitoring solution(s) In the process of unification Huge improvement on deployment/operations Using CERN Agile infrastructure WLCG SAM decoupled from EGI Using a common metric store for multiple applications Evaluating different technologies for storage/visualization 7
8 Agile Infrastructure Over 100 hosts puppetized (into almost 100 hostgroups!) Web portals: SAM3, SSB, Job Monitoring, DDM, REBUS, Hammercloud Services: Dashboard collectors, HC submission nodes Other: Development nodes, build servers (deprecated for koji) Using CERN Koji, git, jira, SLC6 8
9 Decouple of EGI/WLCG SAM 9
10 Towards a common metric store SAM Nagios tests Hammer Cloud jobs Metrics defined by experiments Pledges T R A N S P O R T SSB-like metric store Profile Topology Processing engine UI Downtime 10
11 Common format metric store Using the Site Status Board as a metric store Stores values for instances over a period of time Instances can be sites, services, channels, clouds Provides possibility to combine metrics And also extract them and publish new ones Used by Experiments SSB (CMS, ATLAS, LHCb) SAM (test status, availabilities) Downtimes REBUS (pledges, capacities) Glue validator Cloud accounting 11
12 Creating Derived Metrics Based on some metrics, create new ones: Basic operations provided by SSB AND, OR, OVERWRITE Possibility to fetch data and provide complex calculation Derived metrics can be used as input for even more metrics 12
13 Site usability Combine results of tests with real utilization of the site Including metrics published by experiment Similar to CMS Site readiness and ATLAS site blacklisting ATLAS interested in using this approach for monthly reports To be followed up 13
14 SAM3 interfaces Very similar interface to SUM With FQAN and any service Using SSB as a common metric store Using the tests results from production Including all current profiles And even more! Profiles could be based on any metric Availability and reliability calculated In the process of validation
15 Slide from Jordi Casals (PIC) N8r7 15
16 And even more displays Hola Slides from David Crooks (Glasgow) 16
17 Wigner efficiency analysis Effort involving multiple groups MeyrinVSWigner Correlate job (cpu/wall), events/second, grid id, local batch id, SLCX, virtual/physical, EOS access, network status ATLAS will do functional tests in Geneva and Wigner, passing job Id and Panda id 17
18 CVMFS Monitoring Slides from Costin Grigoras (CERN) See ALICE talk tomorrow for more details 18
19 The future One monitoring solution Easy to maintain/evolve Modular design Plugin/replace components Possibility to correlate data Providing all required information to end user Including Real Time Analytics 19
20 Data Analytics Slide from Uthay Suthakar (Brunel) Luca Magnoni (CERN) Prototype with xrootd monitoring data 20
21 Correlations Are low-efficiency jobs related to storage overload, remote access, hardware type, site? Does the ATLAS transfer rate depend on the number of jobs of another experiment? What is the percentage of pledges used for MC at a particular site? Do the jobs of the same user have different efficiency depending on date/site/concurrent jobs? 21
22 Technology evaluation Keeping an eye on technology evolution Storage: ElasticSearch, hdfs, Impala, Riak Visualization: Ember, Angular Real time analytics: Storm, ESPER 22
23 Summary Consolidating WLCG monitoring Working towards a unified monitoring solution Plenty of work already done: Service deployment/operations on AI Using common tools: Koji, git, jira Using common metric store With different UI tailored to the needs And more to come Data analytics Correlations Technology evolution 23
ANALYSIS FUNCTIONAL AND STRESS TESTING
ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario Úbeda García WLCG Workshop, 8 July 2010 Outline Overview what should
ARDA Experiment Dashboard
ARDA Experiment Dashboard Ricardo Rocha (ARDA CERN) on behalf of the Dashboard Team www.eu-egee.org egee INFSO-RI-508833 Outline Background Dashboard Framework VO Monitoring Applications Job Monitoring
ATLAS job monitoring in the Dashboard Framework
ATLAS job monitoring in the Dashboard Framework J Andreeva 1, S Campana 1, E Karavakis 1, L Kokoszkiewicz 1, P Saiz 1, L Sargsyan 2, J Schovancova 3, D Tuckett 1 on behalf of the ATLAS Collaboration 1
The dashboard Grid monitoring framework
The dashboard Grid monitoring framework Benjamin Gaidioz on behalf of the ARDA dashboard team (CERN/EGEE) ISGC 2007 conference The dashboard Grid monitoring framework p. 1 introduction/outline goals of
Computing at the HL-LHC
Computing at the HL-LHC Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory Group ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic; ATLAS: David Rousseau, Benedetto Gorini,
Configuration Management Evolution at CERN. Gavin McCance [email protected] @gmccance
Configuration Management Evolution at CERN Gavin McCance [email protected] @gmccance Agile Infrastructure Why we changed the stack Current status Technology challenges People challenges Community The
Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb
Batch and Cloud overview Andrew McNab University of Manchester GridPP and LHCb Overview Assumptions Batch systems The Grid Pilot Frameworks DIRAC Virtual Machines Vac Vcycle Tier-2 Evolution Containers
CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT
SS Data & Storage CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT HEPiX Fall 2012 Workshop October 15-19, 2012 Institute of High Energy Physics, Beijing, China SS Outline
DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group
DSS High performance storage pools for LHC Łukasz Janyst on behalf of the CERN IT-DSS group CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Introduction The goal of EOS is to provide a
PoS(EGICF12-EMITC2)110
User-centric monitoring of the analysis and production activities within the ATLAS and CMS Virtual Organisations using the Experiment Dashboard system Julia Andreeva E-mail: [email protected] Mattia
The Evolution of Cloud Computing in ATLAS
The Evolution of Cloud Computing in ATLAS Ryan Taylor on behalf of the ATLAS collaboration CHEP 2015 Evolution of Cloud Computing in ATLAS 1 Outline Cloud Usage and IaaS Resource Management Software Services
Evolution of Database Replication Technologies for WLCG
Home Search Collections Journals About Contact us My IOPscience Evolution of Database Replication Technologies for WLCG This content has been downloaded from IOPscience. Please scroll down to see the full
Evaluation and implementation of CEP mechanisms to act upon infrastructure metrics monitored by Ganglia
Project report CERN Summer Student Programme Evaluation and implementation of CEP mechanisms to act upon infrastructure metrics monitored by Ganglia Author: Martin Adam Supervisors: Cristovao Cordeiro,
Das HappyFace Meta-Monitoring Framework
Das HappyFace Meta-Monitoring Framework B. Berge, M. Heinrich, G. Quast, A. Scheurer, M. Zvada, DPG Frühjahrstagung Karlsruhe, 28. März 1. April 2011 KIT University of the State of Baden-Wuerttemberg and
Report from SARA/NIKHEF T1 and associated T2s
Report from SARA/NIKHEF T1 and associated T2s Ron Trompert SARA About SARA and NIKHEF NIKHEF SARA High Energy Physics Institute High performance computing centre Manages the Surfnet 6 network for the Dutch
LHC schedule: what does it imply for SRM deployment? [email protected]. CERN, July 2007
WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? [email protected] WLCG Storage Workshop CERN, July 2007 Agenda The machine The experiments The service LHC Schedule Mar. Apr.
CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006
CERN local High Availability solutions and experiences Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006 1 Introduction Different h/w used for GRID services Various techniques & First
TRAVERSE: VIRTUALIZATION AND PRIVATE CLOUD MONITORING
TRAVERSE: VIRTUALIZATION AND PRIVATE CLOUD MONITORING SUMMARY Given recent advances in distributed computing, virtualization and private cloud technologies, enterprise datacenters have effectively become
Status and Evolution of ATLAS Workload Management System PanDA
Status and Evolution of ATLAS Workload Management System PanDA Univ. of Texas at Arlington GRID 2012, Dubna Outline Overview PanDA design PanDA performance Recent Improvements Future Plans Why PanDA The
Database Services for Physics @ CERN
Database Services for Physics @ CERN Deployment and Monitoring Radovan Chytracek CERN IT Department Outline Database services for physics Status today How we do the services tomorrow? Performance tuning
DevOps Course Content
DevOps Course Content INTRODUCTION TO DEVOPS What is DevOps? History of DevOps Dev and Ops DevOps definitions DevOps and Software Development Life Cycle DevOps main objectives Infrastructure As A Code
Converged Infrastructure
Converged Infrastructure Convergence is Transforming the Data Centre 2 Dell s Technology Vision 3 28/10/2013 Rolling out new IT services takes too long Inefficient IT leads to difficulty supporting the
HTCondor at the RAL Tier-1
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier, James Adams STFC Rutherford Appleton Laboratory HTCondor Week 2014 Outline Overview of HTCondor at RAL Monitoring Multi-core
OSG Operational Infrastructure
OSG Operational Infrastructure December 12, 2008 II Brazilian LHC Computing Workshop Rob Quick - Indiana University Open Science Grid Operations Coordinator Contents Introduction to the OSG Operations
Agile Infrastructure: an updated overview of IaaS at CERN
Agile Infrastructure: an updated overview of IaaS at CERN Luis FERNANDEZ ALVAREZ on behalf of Cloud Infrastructure Team [email protected] HEPiX Spring 2013 CERN IT Department CH-1211 Genève
Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013
Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013 Virtualisation @ RAL Context at RAL Hyper-V Services Platform Scientific Computing Department
Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de www.gridka.de
Tier-2 cloud Holger Marten Holger. Marten at iwr. fzk. de www.gridka.de 1 GridKa associated Tier-2 sites spread over 3 EGEE regions. (4 LHC Experiments, 5 (soon: 6) countries, >20 T2 sites) 2 region DECH
Storage strategy and cloud storage evaluations at CERN Dirk Duellmann, CERN IT
SS Data & Storage Storage strategy and cloud storage evaluations at CERN Dirk Duellmann, CERN IT (with slides from Andreas Peters and Jan Iven) 5th International Conference "Distributed Computing and Grid-technologies
Virtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
The Agile Infrastructure Project. Monitoring. Markus Schulz Pedro Andrade. CERN IT Department CH-1211 Genève 23 Switzerland www.cern.
The Agile Infrastructure Project Monitoring Markus Schulz Pedro Andrade CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Outline Monitoring WG and AI Today s Monitoring in IT Architecture
Minder. simplifying IT. All-in-one solution to monitor Network, Server, Application & Log Data
Minder simplifying IT All-in-one solution to monitor Network, Server, Application & Log Data Simplify the Complexity of Managing Your IT Environment... To help you ensure the availability and performance
PES. Ermis service for DNS Load Balancer configuration. HEPiX Fall 2014. Aris Angelogiannopoulos, CERN IT-PES/PS Ignacio Reguero, CERN IT-PES/PS
Ermis service for DNS Load Balancer configuration HEPiX Fall 2014 Aris Angelogiannopoulos, CERN IT-PES/PS Ignacio Reguero, CERN IT-PES/PS 1 Outline Core concepts DNS Load Balancing at CERN Motivation and
Running real jobs in virtual machines. Andrew McNab University of Manchester
Running real jobs in virtual machines Andrew McNab University of Manchester Overview The Grid we have now Strategies for starting VMs Fabric / Cloud / Vac Hypervisors Disk images Contextualization Pilot
The dcache Storage Element
16. Juni 2008 Hamburg The dcache Storage Element and it's role in the LHC era for the dcache team Topics for today Storage elements (SEs) in the grid Introduction to the dcache SE Usage of dcache in LCG
How To Use Big Data For Telco (For A Telco)
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
Private Cloud: Regain Control of IT
Platform Private Cloud: Regain Control of IT Computing Songnian Zhou TORONTO 10/21/2010 Copyright 2009 Platform Computing Corporation. All Rights Reserved. The Evolving Data Center (1970-2008) From managing
Big Science and Big Data Dirk Duellmann, CERN Apache Big Data Europe 28 Sep 2015, Budapest, Hungary
Big Science and Big Data Dirk Duellmann, CERN Apache Big Data Europe 28 Sep 2015, Budapest, Hungary 16/02/2015 Real-Time Analytics: Making better and faster business decisions 8 The ATLAS experiment
Marketing Orchestration. Better Metrics, Happier Customers. 72% Impression Lift 107% CTR Improvement
Marketing Orchestration Better Metrics, Happier Customers. If your customers are happy, your metrics will show it. Higher ROI, higher click-through rates, more impressions, more sales. Leverage your 1
Data Management Plan (DMP) for Particle Physics Experiments prepared for the 2015 Consolidated Grants Round. Detailed Version
Data Management Plan (DMP) for Particle Physics Experiments prepared for the 2015 Consolidated Grants Round. Detailed Version The Particle Physics Experiment Consolidated Grant proposals now being submitted
Tier0 plans and security and backup policy proposals
Tier0 plans and security and backup policy proposals, CERN IT-PSS CERN - IT Outline Service operational aspects Hardware set-up in 2007 Replication set-up Test plan Backup and security policies CERN Oracle
Building a logging pipeline with Open Source tools. Iñigo Ortiz de Urbina Cazenave
Building a logging pipeline with Open Source tools Iñigo Ortiz de Urbina Cazenave NLUUG Utrecht - Netherlands 28 May 2015 whoami; 2 Iñigo Ortiz de Urbina Cazenave Systems Engineer whoami; groups; 3 Iñigo
PES. Batch virtualization and Cloud computing. Part 1: Batch virtualization. Batch virtualization and Cloud computing
Batch virtualization and Cloud computing Batch virtualization and Cloud computing Part 1: Batch virtualization Tony Cass, Sebastien Goasguen, Belmiro Moreira, Ewan Roche, Ulrich Schwickerath, Romain Wartel
EMC Data Protection Advisor 6.0
White Paper EMC Data Protection Advisor 6.0 Abstract EMC Data Protection Advisor provides a comprehensive set of features to reduce the complexity of managing data protection environments, improve compliance
Managing Application Performance with JBoss Operations Network and OC Systems RTI
Managing Application Performance with JBoss Operations Network and OC Systems RTI Joe Fernandes - Sr. Product Marketing Manager, Red Hat Steve Sturtevant - Product Manager, OC Systems March 21, 2012 Agenda
Monitoring, Managing and Supporting Enterprise Clouds with Oracle Enterprise Manager 12c Name, Title Oracle
Monitoring, Managing and Supporting Enterprise Clouds with Oracle Enterprise Manager 12c Name, Title Oracle Complete Cloud Lifecycle Management Optimize Plan Meter & Charge Manage Applications and Business
CLOUD DEVELOPMENT BEST PRACTICES & SUPPORT APPLICATIONS
whitepaper CLOUD DEVELOPMENT BEST PRACTICES & SUPPORT APPLICATIONS - Cloud Development Best Practices and Support Applications CLOUD DEVELOPMENT BEST PRACTICES 1 Cloud-based solutions are increasingly
How To Use Happyface (Hf) On A Network (For Free)
Site Meta-Monitoring The HappyFace Project G. Quast, A. Scheurer, M. Zvada CMS Monitoring Review, 16. 17. November 2010 KIT University of the State of Baden-Wuerttemberg and National Research Center of
Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD
Building Out Your Cloud-Ready Solutions Clark D. Richey, Jr., Principal Technologist, DoD Slide 1 Agenda Define the problem Explore important aspects of Cloud deployments Wrap up and questions Slide 2
Information Technology Services. Roadmap 2014-2016
Information Technology Services Roadmap 2014-2016 Introduction This document charts the direction for Humboldt State University s Information Technology Services department over the next three years. It
Total Cloud Control with Oracle Enterprise Manager 12c. Kevin Patterson, Principal Sales Consultant, Enterprise Manager Oracle
Total Cloud Control with Oracle Enterprise Manager 12c Kevin Patterson, Principal Sales Consultant, Enterprise Manager Oracle 2 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert
API Architecture. for the Data Interoperability at OSU initiative
API Architecture for the Data Interoperability at OSU initiative Introduction Principles and Standards OSU s current approach to data interoperability consists of low level access and custom data models
Ansible in Depth WHITEPAPER. ansible.com +1 800-825-0212
+1 800-825-0212 WHITEPAPER Ansible in Depth Get started with ANSIBLE now: /get-started-with-ansible or contact us for more information: info@ INTRODUCTION Ansible is an open source IT configuration management,
FUJITSU Software ServerView Cloud Monitoring Manager V1 Introduction
FUJITSU Software ServerView Cloud Monitoring Manager V1 Introduction November 2015 Fujitsu Limited Product Overview 1 Why a Monitoring & Logging OpenStack Service? OpenStack systems are large, complex
The Remote Infrastructure Management Platform
services capabilities The Remote Infrastructure Management Platform What is the Remote Infrastructure Management (RIM) Platform? As part of our Global Services Operating Architecture, the RIM platform
CA NSM r11.2. Benefits. The CA Advantage. Overview
PRODUCT BRIEF: CA NSM CA NSM r11.2 CA NSM HELPS YOU INTEGRATE AND SIMPLIFY THE MANAGEMENT OF SYSTEMS AND EVENTS ACROSS COMPLEX MULTIVENDOR INFRASTRUCTURE, MANAGING PHYSICAL AND VIRTUAL SYSTEMS IN BOTH
Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
Datasheet FUJITSU Cloud Monitoring Service
Datasheet FUJITSU Cloud Monitoring Service FUJITSU Cloud Monitoring Service powered by CA Technologies offers a single, unified interface for tracking all the vital, dynamic resources your business relies
Easily Connect, Control, Manage, and Monitor All of Your Devices with Nivis Cloud NOC
Easily Connect, Control, Manage, and Monitor All of Your Devices with Nivis Cloud NOC As wireless standards develop and IPv6 gains widespread adoption, more and more developers are creating smart devices
QAD BUSINESS INTELLIGENCE
QAD BUSINESS INTELLIGENCE QAD BUSINESS INTELLIGENCE QAD Business Intelligence unifies data from multiple sources across the enterprise, providing a comprehensive solution that enables key enterprise decision
OpenITSM - IT Service Management with Open Source Software
OpenITSM - IT Service Management with Open Source Software 03.02.2011 CloudExpo London Speaker: Julian Hein NETWAYS Founded 1995 26 full time employees Headquarter Nuremberg, Germany Focus on Open Source
agility made possible
SOLUTION BRIEF Flexibility and Choices in Infrastructure Management can IT live up to business expectations with soaring infrastructure complexity and challenging resource constraints? agility made possible
