The Experiment on the Effectiveness of ALICE Online-Offline Process Monitoring

Transcription

1 The Experiment on the Effectiveness of ALICE Online-Offline Process Monitoring Advisor: Dr. Phond Punchongharn Vasco Chibante Barroso Khanasin Yamnual King Mongkut s University of Technology Thonburi Faculty of Engineering Department of Computer Engineering Bangkok, Thailand

2 Outline Introduction Background and Literature Review Proposed work Evaluation Conclusion

3 Introduction CERN - European Organization for Nuclear Research, is one of the world's largest and most respected centres for scientific research. ALICE - A Large Ion Collider Experiment - The major mission is to study the physics of strongly interacting matter, and in particular the properties of Quark-Gluon Plasma (QGP), using proton-proton, nucleus-nucleus and proton-nucleus collisions at high energies. ALICE O 2 Computing - The resulting data throughput from the detector has been estimated to be greater than 1TB/s for Pb Pb events, roughly two orders of magnitude more than in Run 1. - The computing system has to be upgraded. - In the design, the data volume reduction will be achieved by reconstructing the data in several steps synchronously with data taking.

4 Introduction (cont.) Control Configuration and Monitoring (CCM) - act as a tightly-coupled entity with the role of supporting users and automatizing day-to-day operations. In this research, we will focus on the Monitoring system.

5 Motivation To acquire such an effective monitoring system, it must be able to collect the system status information from the O 2 system and archive the monitoring data into the persistent storage for historical record access. - Examples of monitoring data are CPU load average and process memory usage. The Monitoring system should be able to trigger action automatically or by human when the condition meets. - An example of alarms and action triggering, when the CPU has high temperature, the Monitoring system shutdown machine on very high CPU temperature. For the time being, in the O 2 system, - the estimated number of nodes are 1,623 nodes - the number of processes is estimated to be between 7 K and 70 K. The monitoring system should be able to collect, transport and eventually store high frequency monitoring data up to 544KHz.

6 Research Problem Using ELK stack to achieve a uniform and user friendly monitoring interface as a single entry point to the O 2 monitoring data. Also to acquire the module of identifying unusual events in purpose of triggering actions.

7 Scope of Work Our monitoring agent implementation integrating with ELK stack. A practical module of action triggering.

8 Background and Literature Review The upgrade of the ALICE Online-Offline Control Configuration and Monitoring CCM Zabbix MonALISA Nagios LEMON ELK stack

9 The upgrade of the ALICE Online-Offline It will become a new common system called O2. It will restart collecting experimental data for the next run (Run3) in The estimated data throughput is expected to be greater than 1 TB/s for Pb-Pb events, which is approximately two orders of magnitude more than in the first run. The O2 system has been designed to support both online synchronous data reduction and asynchronous and iterative data processing.

10 The upgrade of the ALICE Online-Offline 2 main computing clusters - First Level Processors (FLPs) - Event Processing Nodes (EPNs), and other necessary dedicated nodes. To achieve the goal of the online data processing along with data taking, the O2 system will require components to control the grid facility. ALICE computing-working group has introduced the system, namely Control, Configuration and Monitoring (CCM) components

11 CCM The Control, Configuration and Monitoring (CCM) components of the ALICE O2 system will act as a tightly-coupled entity with the role of supporting users and automatizing day-to-day operations. The Control system is responsible for coordinating all the O2 processes according to system status and monitoring data. The Configuration system ensures that both the application and environment parameters are properly set. Finally, the Monitoring system gathers information from the O2 system with the aim of identifying unusual patterns and raising alarms.

12 CCM The CCM systems will also need to interface with other ALICE subsystems such as the ALICE Trigger system, Detector Control System (DCS) and Storage systems in order to send commands, transmit configuration parameters, submit jobs, and receive status and monitoring data. In this research, we will mainly focus on one of the CCM components, which is Monitoring component.

13 The Monitoring system The Monitoring system has a role of gathering information from the O2 components and processes to be able to assess the status and health of the entities in quasi real time. It will raise the alarm when it founds the unusual patterns. It should also be able to aggregate monitoring data to provide high-level views of the entire system and archive relevant metrics for long-term analysis and forensic investigation as well as reduce the volume of data received continuously by the subscribers. The Monitoring system provides an application programming interface (API) allowing any software component to publish heartbeat and explicit monitoring data to a common data store. The same data store also provides periodic reporting of operating system views of the main processes and other critical services and monitoring data collected from the infrastructure, such as server health and utilization monitoring and fabric monitoring data. The API also allows query on current monitoring values or the historical data. Specifically, this will be used by the Control system to assess the health of the system in general and trigger actions accordingly.

14 Existing Monitoring tools ZABBIX - It is used for system performance monitoring of the ALICE Data Acquisition (DAQ) system. MonALISA - It provides grid-level monitoring of the ALICE grid environment. - It is used to collect monitoring information of jobs (CPU resources), storage servers (disk, tape), data transfers (network), network fabric, and management software (infrastructure).

15 Existing Monitoring tools NAGIOS - It is used for grid infrastructure monitoring system - It cannot scale on thousands of hosts and tens of thousands services. LEMON - It is used at CERN is LHC Era Monitoring (LEMON) system. - It collects information of monitor servers, network equipment, associated software, additional environment and facilities data for CERN computer centre.

16 ELK stack The ELK stack comprises of Elasticsearch (ES), Logstash, and Kibana and it is generally referred to as the Elasticsearch ecosystem.

17 ELK stack - Elasticsearch

18 Elasticsearch (ES) Open source search and analytics engine built on top of the Apache Lucene information retrieval library. It is a NoSQL database and be able to be scalable and distributed. (Shards and Replicas) Every entry is stored as schema-free JSON documents and all fields can be indexed and used in a single query. It allows full-text search on unstructured data through a RESTful API using JSON over http.

19 ELK stack - Logstash

20 Logstash an open source tool used to receive, process, and output any logs. It can be easily configured via plugins for input, output and data-filters and provides a powerful pipeline for storing, querying, and analyzing logs. As ES acts as a backend data store and Kibana acts as a front-end web app, Logstash become a workhorse sending data to the ES.

21 ELK stack - Kibana

22 Kibana An open source analytics and visualization platform to work with ES. It can be used to search, view, and interact with the ES data. In addition, it provides an advanced data analysis and visualize data in a variety of charts, tables, and maps. This software can be hosted on any web server. Additional implementation for the software is allowed in order to acquire specific needs. By clicking in few mouse clicks, we can create custom interactive dashboards without any prior GUI programming knowledge. Kibana provides a set of useful pre-defined plot types like pies, histograms or trends.

23 Proposed Work Function # of nodes FLPs 250 EPNs 1250 DB servers 5 Control servers 6 Configuration servers 6 Monitoring servers 6 QA/DQM servers 30 Calibration servers 30 Storage servers 10 Network servers 5 Operator terminals 25 Total 1623 Overall System Design Estimated number of nodes

24 Monitoring System Multiple Elasticsearch master and data nodes A single visualization server A Logstash instance on the EPNs and other desired nodes This focuses on the infrastructure, hosts and processes while allowing explicit application parameters to be sent from any entity in the system. Monitoring System Design

25 Elasticsearch server As Elasticsearch acts as a no SQL data store and can be distributed and scalable, the number of nodes has not been decided yet. However, it is definitely more than a single node. The Elasticsearch servers should handle the monitoring frequencies between 60 KHz to 544 KHz according to the estimated number of processes and number of hosts. All the data in Elasticsearch can be transferred to the persistent storage for archival and further analysis.

26 Visualization Server In order to visualize such a big data and be able to retrieve value from Elasticsearch servers, we need a robust webbased graphical user interface (GUI). Fortunately, in the stack, there is a web app that architected to work with Elasticsearch called Kibana. A little and easy configuration as it just defines a source IP address or hostname can provide visualization of monitoring data both in text and graphs. Kibana can provide some levels of aggregation on monitoring data depends on what the administrators interests.

27 Logstash Instances on Clients With Logstash, clients can transport their own useful monitoring data to the Elasticsearch server. The configuration is needed once after the installation. The output of Logstash will be pointed to one of Elasticsearch servers.

28 Monitoring Agent The agent will be implemented in C++ code. On every monitored node, a monitoring agent is launched and retrieved monitoring values. Afterwards, it stores values into log files. Logstash instance reads out from those log files as it is configured. Here is a list of metrics expected; Host monitoring: CPU (10 metrics), Networking (4 metrics / interface), Memory (10 metrics), Processes status (5 metrics), Sockets status (10 metrics), Disk status (10 metrics / device), Process monitoring (from the system point of view: CPU, memory profile, handles): 10 metrics / process

29 Action triggering Due to some events might be critical for data taking, an action triggering module should be able to deliver the specific alarm to the required users. It can distribute the alarms via GUI or and should inform other subsystems about the events related to them.

30 Evaluation Data collection and archival Action triggering The Monitoring system should be able to handle between ~ 60kHz to ~544kHz. Both raw and aggregated monitoring data should be able to be visualized and noticed by anyone who are interested in, especially the administrators

31 Conclusion Our Monitoring system for ALICE O2 computing system is designed based on server-agent concept. By adopting ELK stack - Logstash instance on each individual node will read the monitoring data from files and transport those data to the Elasticsearch servers. - Kibana will provide a simple interface to visualize the measurements on both current and historical records. Finally, the action triggering module is able to raise the alarm to the administrators or shifter when it detects an unusual pattern in O2 system.

32 Thank you

33 Q&A

34 References [1] Aamodt, K. et al., The ALICE experiment at the CERN LHC, JINST 3 (2008) S Available: [2] Suaide, A. Alarcon Do Passo, et al., O2: A novel combined online and offline computing system for the ALICE Experiment after 2018., Journal of Physics: Conference Series. Vol No. 1. IOP Publishing, Available: [3] L. Betev, T. Breitner, S. Chapeland, A. Gheata, B. v. Haller, M. Richter, ALICE Computing software framework for LS2 Upgrade. Available: [4] ALICE Collaboration. Technical Design Report for the Upgrade of the Online-Offline Computing System, ALICE-TDR-019, Apr Available: [5] Telesca, Adriana, et al., System performance monitoring of the ALICE Data Acquisition System with Zabbix., Journal of Physics: Conference Series. Vol No. 6. IOP Publishing, [6] C. Grigoras, R. Voicu, N. Tapus, I. Legrand, F. Carminati and L. Betev, MonALISA-based Grid monitoring and control, The European Physical Journal Plus, vol. 126, no. 1, [7] Imamagic, Emir, and Dobrisa Dobrenic. Grid infrastructure monitoring system based on nagios., Proceedings of the 2007 workshop on Grid monitoring. ACM, [8] Marian, Babik, et al., LEMON-LHC Era Monitoring for Large-Scale Infrastructures., Journal of Physics: Conference Series. Vol No. 5. IOP Publishing, [9] K. Fatemaa, V. C. Emeakarohaa, P. D. Healya, J. P. Morrisona, T. Lynn, A Survey of Cloud Monitoring Tools: Taxonomy, Capabilities and Objectives, Journal of Parallel and Distributed Computing, [10] I.C. Legrand et al. MonALISA: An Agent Based, Dynamic Service System to Monitor, Control and Optimize Grid Based Applications, CHEP04, Switzerland, [11] Catalin C. Cirstoiu, Costin C. Grigoras, Latchezar L. Betev, Alexandru A. Costan, Iosif Charles Legrand, Monitoring, accounting and automated decision support for the ALICE experiment based on the MonALISA framework, Proceedings of the 2007 workshop on Grid monitoring, 2007 [12] S. Bagnasco, D. Berzano, A. Guarise, S. Lusso, M. Masera, and S. Vallero, Monitoring of iaas and scientific applications on the cloud using the elasticsearch ecosystem, Journal of Physics: Conference Series, Vol. 608 (IOP Publishing, 2015) pp , Available: [13] CERN; [14] Zabbix; [15] MonALISA; [16] Nagios; [17] LEMON; [18] Elasticsearch; [19] Logstash; [20] Kibana; [21] K. Vandikas, V. Tsiatsis, Performance evaluation of an IoT Platform, Eighth International Conference on Next Generation Mobile Apps, Services and Technologies, NGMAST 2014, IEEE 2014 [22] Bai, Jun. Feasibility analysis of big log data real time search based on Hbase and ElasticSearch. Natural Computation (ICNC), 2013 Ninth International Conference on. IEEE, [23] Lahmadi, Abdelkader, et al. A platform for the analysis and visualization of network flow data of android environments., Integrated Network Management (IM), 2015 IFIP/IEEE International Symposium on. IEEE, 2015.