The Data Quality Monitoring Software for the CMS experiment at the LHC



Similar documents
Data Quality Monitoring. workshop

CMS data quality monitoring web service

Database Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group

The CMS analysis chain in a distributed environment

Online CMS Web-Based Monitoring. Zongru Wan Kansas State University & Fermilab (On behalf of the CMS Collaboration)

World-wide online monitoring interface of the ATLAS experiment

Web application for detailed realtime database transaction monitoring

ARDA Experiment Dashboard

Distributed Database Access in the LHC Computing Grid with CORAL

Status and Evolution of ATLAS Workload Management System PanDA

ATLAS job monitoring in the Dashboard Framework

Computing at the HL-LHC

Data Quality Monitoring for the CMS Electromagnetic Calorimeter. Giuseppe Della Ricca, Emanuele Di Marco, Giovanni Franzoni, Benigno Gobbo

Design Document. Offline Charging Server (Offline CS ) Version i -

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

CERN analysis preservation (CAP) - Use Cases. Sünje Dallmeier Tiessen, Patricia Herterich, Peter Igo-Kemenes, Tibor Šimko, Tim Smith

U-LITE Network Infrastructure

Exchange 2003 Mailboxes

DiskPulse DISK CHANGE MONITOR

Realization of Inventory Databases and Object-Relational Mapping for the Common Information Model

GEOFLUENT TRANSLATION MANAGEMENT SYSTEM

HIP Computing Resources for LHC-startup

HappyFace for CMS Tier-1 local job monitoring

Managing Microsoft Office SharePoint Server Content with Hitachi Data Discovery for Microsoft SharePoint and the Hitachi NAS Platform

Storage Sync for Hyper-V. Installation Guide for Microsoft Hyper-V

Web based monitoring in the CMS experiment at CERN

From traditional to alternative approach to storage and analysis of flow data. Petr Velan, Martin Zadnik

Installation and Configuration Guide for Windows and Linux

Ekran System Help File

Building native mobile apps for Digital Factory

Cloud Optimize Your IT

aaps algacom Account Provisioning System

OrgPublisher EChart Server Setup Guide

Rich Media & HD Video Streaming Integration with Brightcove

"FRAMEWORKING": A COLLABORATIVE APPROACH TO CONTROL SYSTEMS DEVELOPMENT

CLAS12 Offline Software Tools. G.Gavalian (Jlab)

Resource control in ATLAS distributed data management: Rucio Accounting and Quotas

Client/Server Grid applications to manage complex workflows

Online data handling with Lustre at the CMS experiment

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Avid Interplay Web Services. Version 1.4

Oracle Primavera P6 Enterprise Project Portfolio Management Performance and Sizing Guide. An Oracle White Paper October 2010

IBM Tivoli Storage Manager for Microsoft SharePoint

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

Web Performance, Inc. Testing Services Sample Performance Analysis

MA-WA1920: Enterprise iphone and ipad Programming

Functional Requirements for Digital Asset Management Project version /30/2006

TDAQ Analytics Dashboard

Enhancing your Web Experiences with ASP.NET Ajax and IIS 7

SyncThru Database Migration

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

OpenText Information Hub (ihub) 3.1 and 3.1.1

ATLAS Petascale Data Processing on the Grid: Facilitating Physics Discoveries at the LHC

Yahoo! Communities Architectures Ian Flint


MTA Course: Windows Operating System Fundamentals Topic: Understand backup and recovery methods File name: 10753_WindowsOS_SA_6.

Performance Tuning Guide for ECM 2.0

Titolo del paragrafo. Titolo del documento - Sottotitolo documento The Benefits of Pushing Real-Time Market Data via a Web Infrastructure

Managing Existing Mobile Apps

Online Data Monitoring Framework Based on Histogram Packaging in Network Distributed Data Acquisition Systems

Adobe Summit 2015 Lab 718: Managing Mobile Apps: A PhoneGap Enterprise Introduction for Marketers

Managing Multi-Hypervisor Environments with vcenter Server

PoS(EGICF12-EMITC2)110

How to Prepare for the Upgrade to Microsoft Dynamics CRM 2013 (On-premises)

Avid. Avid Interplay Web Services. Version 2.0

Content. Development Tools 2(63)

ProxySG TechBrief Implementing a Reverse Proxy

Cisco Active Network Abstraction 4.0

Sage CRM Technical Specification

Citrix Desktop Virtualization Fast Track

SAP & hybris Integration: Technical Considerations, Tips, and Best Practices

Egnyte Local Cloud Architecture. White Paper

HGC SUPERHUB HOSTED EXCHANGE

Administration GUIDE. SharePoint Server idataagent. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 201

CMB-207-1I Citrix Desktop Virtualization Fast Track

Migrating MSDE to Microsoft SQL 2008 R2 Express

SCCM Plug-in User Guide. Version 3.41

MODULE 7: TECHNOLOGY OVERVIEW. Module Overview. Objectives

ArcGIS for Server: Administrative Scripting and Automation

EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH CERN ACCELERATORS AND TECHNOLOGY SECTOR A REMOTE TRACING FACILITY FOR DISTRIBUTED SYSTEMS

Designing a Windows Server 2008 Applications Infrastructure

CatDV Pro Workgroup Serve r

13.1 Backup virtual machines running on VMware ESXi / ESX Server

Configuring Backup Settings. Copyright 2009, Oracle. All rights reserved.

Configure Outlook 2013 to connect to Hosted Exchange

Transcription:

The Data Quality Monitoring Software for the CMS experiment at the LHC On behalf of the CMS Collaboration Marco Rovere, CERN CHEP 2015 Evolution of Software and Computing for Experiments Okinawa, Japan, April 13-17 2015

DQM is Everywhere 2

Preview 3 Online DQM Offline DQM Generic Tools & Beyond Run2 further developments and possible improvements

Online DQM Challenges and Solutions

Network DAQ to DQM communication relied on the Storage Manager Proxy Server (SMPS): always served the latest event available Filtering capabilities provided by the SMPS HLT histograms merging provided by SMPS The SMPS disappeared in the improved DAQ2 design. 5 It s all about the Storage Manager Proxy Server Online DQM during Run I

6 Taking responsibilities: give back to DQM what is DQM s Online DQM during Run II Completely file based and not event driven: receive all events/ls, should be processed sequentially No prior filtering applied HLT histograms merging in DQM hands Better separation of responsibilities Required deep redesign of processing logic.

File-based approach & its challenges Requirements Live event monitoring No external (XDAQ) process to start/stop DQM applications. No prior event selection Fast histogram merging Solutions Reduce latency by: Automatic file discovery Tunable min. events/lumi Skip to latest lumi/file. Reuse of HLT technology to handle all run transitions Event selection from within main framework Dedicated merging utility: fasthadd (ROOT+ ProtocolBuffer) Merging Utility 10 files 30 files 50 files ROOT s hadd 10.8s 48.9s 125s fasthadd 3.8s 10.2s 17s 7

Offline DQM Challenges and Solutions

Multi-threading in the CMS reconstruction framework (CMSSW) 9 What is it? Ability to process many events in different threads. Where is it? At the root s of CMSSW, affects everything. Why affects DQM? DQM uses a shared memory pool for holding ROOT histograms. ROOT histograms.

Multithreading in DQM Basic Concepts 10 Serialize access to shared resources/data Enforce policy via an interface Localize resources/histograms to each thread Aggregate histograms at the end of data processing Hide complexity to end-users Avoid a full rewrite of the existing DQM Framework Prevent users from doing mistakes.

Transition to Multithreading in numbers Design of core components: ~6 man/months So far ~90% of the DQM and VALIDATION code has been migrated It involved 400+ classes and related helper ones Several months(12+) of work The threaded DQM version is regularly exercised in the THREADED Integration Build (run twice per day) Comparison of single and multithreaded version are identical Performance numbers in Chris Jones talk [Track2 14/4/15 15:00]: 5-15% loss in CPU usage Factor of 3-4 in Memory reduction Mostly not coming from DQM 11

Miscellanea and beyond Run2

Further Developments 13 Data and Monte Carlo agreement key ingredient for many analysis Embed this functionality into the DQM Framework DQMGUI has been constantly improved ROOT6 compatible New APIs to fully expose all its content Automatic histogram stacking Improved performance

Conclusions 14

Conclusions 15 DQM framework proved to be extremely flexible and stable DQM is used everywhere in CMS DQM Framework adapted and improved in the face of fundamental changes in the Online (DAQ2) and Offline (Multithreading) environments The central DQM Tools have been constantly improved during Long Shutdown 1. The new DQM design and implementation proved to be extremely effective since the beginning of the commissioning period in 2015.

BACKUP SLIDES 16

DQM in CMS: Core Components 17 DQMStore: shared container that holds all Monitoring Information. MonitorElement: ROOT objects Quality Information Folder hierarchy Flags DQMNet: layer to ship monitoring information over network. DQMService: ties DQMStore and DQMNet together.

MuliThreaded DQM 18

DQM GUI A central component of the data quality monitoring system of the CMS Experiment is a web site for browsing data quality histograms. It guarantees authenticated Worldwide access. It is a single customizable application capable of delivering visualization for all the DQM needs in all of CMS, for all subsystems, for live data taking as much as archives and offline workflows. 19

20 DQM GUI Architecture: C++, CherryPy, JS Render Plugins

DQM GUI Developments during LS: APIs The DQM GUI is capable of exposing its index via many APIs 21

Performance Plot of DQM GUI HTTP Requests Served Per Day - Online Server Offsite! 10,000,000" HTTP" Plots" JSON_API" JSON_SAMPLES" DATA_BROWSE" DATA_PUT" StripChart" 1,000,000" 100,000" 10,000" 1,000" 100" 10" 1" Nov'11" Dec'11" Jan'12" Feb'12" Mar'12" Apr'12" May'12" Jun'12" Jul'12" Aug'12" Sep'12" Oct'12" Nov'12" Dec'12" Jan'13" 10,000,000"ms" HTTP Response Time Daily Average - Offline Server! HTTP" Plots" JSON_API" JSON_SAMPLES" DATA_BROWSE" DATA_PUT" StripChart" 1,000,000"ms" 100,000"ms" 10,000"ms" 1,000"ms" 100"ms" 10"ms" 1"ms" Jun'11" Jul'11" Aug'11" Sep'11" Oct'11" Nov'11" Dec'11" Jan'12" Feb'12" Mar'12" Apr'12" May'12" Jun'12" Jul'12" Aug'12" Sep'12" Oct'12" Nov'12" Dec'12" Jan'13" 22

DQM (GUI) in numbers Show some interesting statistics: ü O(15K) lines of C++ code (web server accelerator, render engine, index manipulation), O(5K) python lines and O(4K) javascript lines. Quantity Offline Server RelVal Server Online Server SAMPLES 264.383 31.587 16.042 SOURCE FILES 303.580 31.744 387.298 DATASETS 2.068 18.379 1 CMSSW VERSIONS -- 90 -- UNIQUE OBJECTS (MEs) 913.174 1.649.086 346.448 STREAMERS 1 1 2 DISK SPACE (TB) 2.5 0.38 0.06 23

Data vs Monte Carlo automatic tools Data and Monte Carlo agreement key ingredient for many analysis Embed this functionality into the DQM Framework Properly combined many MC samples event weight applied separately for each MC sample The scaling factor is the cross section taken as a parameter from the configuration. The number of produced events is taken from an existing ME Datasets are summed in a final, cumulative, sample DQMGUI has the capability to properly stack histograms from different sample. 24

DQM in CMS: Requirements 25 ü DQM should aggregate event level information that is sensitive to both detector and software (HLT) problems in one central place (web server). ü Fast turn-around and small latency. Update of info frequently during runs. Quality tests up to every lumi-section. ü Automatic alarms should notify about problems. ü Synoptic overview of detector status (front page). ü Shift level histograms. ü Expert histograms. ü The DQM information is key input to the creation of the good run list (aka JSON). ü The web server should be accessible everywhere, not just inside P5. ü DQM should be maintainable in a modular way by subsystems with fast updates outside regular release cycles. ü DQM needs to run in spy mode, in order to not interfere with the data taking.

DQM Offline Prompt Reconstruction, Calibrations, rereconstruction, simulation and release validation all use the same processing model. Histograms(ME) created in jobs, saved in normal data files, harvested periodically and merged into full statistics with DAQ, DCS info and finally tested for quality and summarised. Resulting histograms are uploaded to the GUI web server hosted at CERN, backed up to CASTOR/EOS. Final quality summary flags are stored into condition database for certification. Differences are in content and timing. Tier-0, Tier-1s re-determine detector status using full event statistics, full reconstruction, plus add monitoring for physics objects; Tier-0 t ~ 48h, Tier-1s days+. CAF t hours to days on Al-Ca entities. Validation verifies MC data. 26

CMSSW Multi-{Core,Thread} Global ü Sees transitions on a global scale ü see begin of Run and begin of Lumi when source first reads them ü sees end of Run and end of Lumi once all processing has finished for them ü Multiple transitions can be running concurrently ü two or more begin or end Runs (for different runs) ü two or more begin or end Lumis ü and end can be occurring while another begin is running Events are not seen globally Stream ü Processes transitions serially: begin run, begin lumi, events, end lumi, end run. ü Multiple streams can be running concurrently each with own events. 27

DQM in CMS: Subsystem-specific modules Book histogram Fill histogram Write histogram THn* THn* Run-based histogram Lumi-based histogram DQMStore THn* THn* THn* THn* cmsrun Begin Job Begin Run Begin Lumi Event 1 Event 2 Event... End Lumi Begin Lumi Event N+1 Even N+... Event N+M End Lumi Begin Lumi Event N+M+1 Event K End Lumi End Run End Job ü Each module, in turn, sees a fixed sequence of events raised by the framework. ü Each module books the MonitorElements into the central DQMStore and receives back pointers to the newly created MEs. ü Events are serially processed and the MonitorElements are properly filled. ü At every End{Lumi,Run}, the corresponding MonitorElements are permanently written on disk. ü In the case in which the DQMNet is available, all modified MonitorElements will push their changes to the other side of the network channel (GUI s collector). ü The harvesting step is similar in concept, but will only see Run and Lumi transitions and will only have access to MonitorElements, not to the events. ü Quality tests are run in the harvesting in the end* transitions, depending on their configuration. 28