A Year of HTCondor Monitoring. Lincoln Bryant Suchandra Thapa



Similar documents
HTCondor at the RAL Tier-1

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Efficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET. ISGC 2013, March 2013

Processing millions of logs with Logstash

A New Approach to Network Visibility at UBC. Presented by the Network Management Centre and Wireless Infrastructure Teams

Log managing at PIC. A. Bruno Rodríguez Rodríguez. Port d informació científica Campus UAB, Bellaterra Barcelona. December 3, 2013

Blackboard Open Source Monitoring

Scaling Graphite Installations

Improved metrics collection and correlation for the CERN cloud storage test framework

How To Use Elasticsearch

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Analyzing large flow data sets using. visualization tools. modern open-source data search and. FloCon Max Putas

WHITE PAPER Redefining Monitoring for Today s Modern IT Infrastructures

Log management with Logstash and Elasticsearch. Matteo Dessalvi

Logging on a Shoestring Budget

Integration of IT-DB Monitoring tools into IT General Notification Infrastructure

ATLAS job monitoring in the Dashboard Framework

Andrew Moore Amsterdam 2015

Building a logging pipeline with Open Source tools. Iñigo Ortiz de Urbina Cazenave

Log Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory

Using Logstash and Elasticsearch analytics capabilities as a BI tool

Modern Web development and operations practices. Grig Gheorghiu VP Tech Operations Nasty Gal

April 8th - 10th, 2014 LUG14 LUG14. Lustre Log Analyzer. Kalpak Shah. DataDirect Networks. ddn.com DataDirect Networks. All Rights Reserved.

Bernd Ahlers Michael Friedrich. Log Monitoring Simplified Get the best out of Graylog2 & Icinga 2

Log Management with Open-Source Tools. Risto Vaarandi SEB Estonia

Streamlining Infrastructure Monitoring and Metrics in IT- DB-IMS

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Monitoring HTCondor with Ganglia

Deep Dive Monitoring Servers using BI 4.1. Alan Mayer Solid Ground Technologies SESSION CODE: 0305

Wait, How Many Metrics? Monitoring at Quantcast

the missing log collector Treasure Data, Inc. Muga Nishizawa

Log Management with Open-Source Tools. Risto Vaarandi rvaarandi 4T Y4H00 D0T C0M

Log management with Graylog2 Lennart Koopmann, FrOSCon Mittwoch, 29. August 12

MySQL Enterprise Monitor

glideinwms Training HTCondor scalability testing hints by Igor Sfiligoi, UC San Diego Aug 2014 Glidein scalability hints 1

Improve performance and availability of Banking Portal with HADOOP

Using NXLog with Elasticsearch and Kibana. Using NXLog with Elasticsearch and Kibana

An objective comparison test of workload management systems

Deploying and Managing SolrCloud in the Cloud ApacheCon, April 8, 2014 Timothy Potter. Search Discover Analyze

Powering Monitoring Analytics with ELK stack

A Basic Introduction to DevOps Tools

Using WebLOAD to Monitor Your Production Environment

EventSentry Overview. Part I About This Guide 1. Part II Overview 2. Part III Installation & Deployment 4. Part IV Monitoring Architecture 13

Information Retrieval Elasticsearch

Maintaining Non-Stop Services with Multi Layer Monitoring

Cloud Computing. Lectures 3 and 4 Grid Schedulers: Condor

HTCondor within the European Grid & in the Cloud

HADOOP AT NOKIA JOSH DEVINS, NOKIA HADOOP MEETUP, JANUARY 2011 BERLIN

HPCC Monitoring and Reporting (Technical Preview) Boca Raton Documentation Team

Towards Smart and Intelligent SDN Controller

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Collaborative Open Market to Place Objects at your Service

JaM - PHP Error Monitoring Extension

Sentimental Analysis using Hadoop Phase 2: Week 2

ArcGIS GeoEvent Extension for Server: Working with Community Connectors and Processors

DJANGOCODERS.COM THE PROCESS. Core strength built on healthy process

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Evaluating the impact of research online with Google Analytics

Big Data for Satellite Business Intelligence

Efficient Management of System Logs using a Cloud

Testing 3Vs (Volume, Variety and Velocity) of Big Data

CI Pipeline with Docker

MONGODB - THE NOSQL DATABASE

Data Discovery and Systems Diagnostics with the ELK stack. Rittman Mead - BI Forum 2015, Brighton. Robin Moffatt, Principal Consultant Rittman Mead

ROCANA WHITEPAPER How to Investigate an Infrastructure Performance Problem

VMware vcenter Log Insight Getting Started Guide

Client Overview. Engagement Situation. Key Requirements

Lustre Monitoring with OpenTSDB

How To Monitor A Server With Zabbix

Case Study : 3 different hadoop cluster deployments

28 What s New in IGSS V9. Speaker Notes INSIGHT AND OVERVIEW

Garuda: a Cloud-based Job Scheduler

EOS Monitoring and Analytics Tools

Hadoop EKG: Using Heartbeats to Propagate Resource Utilization Data

A SURVEY ON AUTOMATED SERVER MONITORING

How to create a load testing environment for your web apps using open source tools by Sukrit Dhandhania

Monitoring the Grid at local, national, and global levels

Understanding Performance Monitoring

Configuring WMI Performance Monitors

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Features Overview Guide About new features in WhatsUp Gold v14

Log management with Graylog2 Lennart Koopmann, Kieker Days Mittwoch, 5. Dezember 12

TPAf KTl Pen source. System Monitoring. Zenoss Core 3.x Network and

Mobile Analytics. mit Elasticsearch und Kibana. Dominik Helleberg

OnCommand Performance Manager 1.1

Com.X. Call Center Analyser. User Guide

FileNet System Manager Dashboard Help

Using New Relic to Monitor Your Servers

Equalizer VLB Beta I. Copyright 2008 Equalizer VLB Beta I 1 Coyote Point Systems Inc.

Using Redis as a Cache Backend in Magento

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Log infrastructure & Zabbix. logging tools integration

1.Full-Time Positions Marketing and Project Consultant

XpoLog Center Suite Log Management & Analysis platform

glideinwms monitoring from a VO Frontend point of view

System performance monitoring in RTMT

Technical Overview Simple, Scalable, Object Storage Software

CLOUDSTACK DESIGN DECISIONS CLOUD OPERATIONS AT SCALE

Cloud Computing. Up until now

Transcription:

A Year of HTCondor Monitoring Lincoln Bryant Suchandra Thapa HTCondor Week 2015 May 21, 2015

Analytics vs. Operations Two parallel tracks in mind: o Operations o Analytics Operations needs to: o Observe trends. o Be alerted and respond to problems. Analytics needs to: o Store and retrieve the full ClassAds of every job. o Perform deep queries over the entire history of jobs.

Motivations For OSG Connect, we want to: o o o o o track HTCondor status discover job profiles of various users to determine what resources users are need determine when user jobs are failing and help users to correct failures create dashboards to provide high-level overviews open our monitoring up to end-users o be alerted when there are problems Existing tools like Cycleserver, Cacti, Ganglia, etc.. did not adequately cover our use case

Operations

Graphite Real-time graphing for time series data Open source, developed by Orbitz Three main components: o Whisper - Data format, replacement for RRD o Carbon - Listener daemon o Front-end - Web interface for metrics Dead simple protocol o Open socket, fire off metrics in the form of: path.to.metric <value> <timestamp> Graphite home page: http://graphite.wikidot.com/

Sending HTCondor stats To try it out, you can just parse the output of condor_q into the desired format Then, simply use netcat to send it to the Graphite server #!/bin/bash metric= htcondor.running value=$(condor_q grep R wc -l) timestamp=$(date +%s) echo $metric $value $timestamp nc \ graphite.yourdomain.edu 2003

A simple first metric Run the script with cron:

Problems with parsing Parsing condor_q output is a heavyweight and potentially fragile operation o Especially if you do it once a minute o Be prepared for gaps in the data if your schedd is super busy What to do then? o Ask the daemons directly with the Python bindings!

Collecting Collector stats via Python Here we ask the collector for slots in claimed state and sum them up: import classad, htcondor coll = htcondor.collector("htcondor.domain.edu") slotstate = coll.query(htcondor.adtypes.startd, "true",['name','jobid','state','remoteowner','collector_host_stri NG']) for slot in slotstate[:]: if (slot['state'] == "Claimed"): slot_claimed += 1 print "condor.claimed "+ str(slot_claimed) + " " + str(timestamp)

Sample HTCondor Collector summary Fire off a cron & come back to a nice plot:

Grafana Graphite is nice, but static PNGs are so Web 1.0 Fortunately, Graphite can export raw JSON instead Grafana is an overlay for Graphite that renders the JSON data using flot o Nice HTML / Javascript-based graphs o Admins can quickly assemble dashboards, carousels, etc o Saved dashboards are backed by nosql database All stored Graphite metrics should work out of the box Grafana home page: http://grafana.org/ flot home page: http://www.flotcharts.org/

Sample dashboard

Sample dashboard

An example trend A user s jobs were rapidly flipping between idle and running. Why? Turns out to be a problematic job with an aggressive periodic release: periodic_release = ((CurrentTime - EnteredCurrentStatus) > 60) (Credit to Mats Rynge for pointing this out)

Active monitoring with Jenkins Nominally a continuous integration tool for building software Easily configured to submit simple Condor jobs instead Behaves similar to a real user o Grabs latest functional tests via git o Runs condor_submit for a simple job o Checks for correct output, notifies upon failure Plethora of integrations for notifying sysops of problems: o Email, IRC, XMPP, SMS, Slack, etc.

Jenkins monitoring samples Dashboard gives a reassuring all-clear: Slack integration logs & notifies support team of problems:

Operations - Lessons learned Using the HTCondor Python bindings for monitoring is just as easy, if not easier, than scraping condor_{q,status} If you plan to have a lot of metrics, the sooner you move to SSD(s), the better Weird oscillatory patterns, sudden drops in running jobs, huge spikes in idle jobs can all be indicative of problems Continuous active testing + alerting infrastructure is key for catching problems before end-users do

Analytics

In the beginning... We started with a summer project where students would be visualizing HTCondor job data To speed up startup, we wrote a small python script (~50 lines) that queried HTCondor for history information and added any new records to a MongoDB server o o Intended so that students would have an existing data source Ended up being used for much longer

Initial data visualization efforts Had a few students working on visualizing data over last summer o o o Generated a few visualizations using MongoDB and other sources Tried importing data from MongoDB to Elasticsearch using Kibana Used a few visualizations for a while but eventually stopped due to maintenance required Created a homebrew system using MongoDB, python, highcharts and cherrypy

Current setup Probes to collect condor history from log file and to check the schedd every minute Redis server for pub/sub channels for probes Logstash to follow Redis channels and to insert data into Elasticsearch Elasticsearch cluster for storage and queries Kibana for user and operational dashboards, RStudio/python scripts for more complicated analytics

Indexing job history information Python script polls the history logs periodically for new entries and publishes this to a Redis channel. Classads get published to a channel on the Redis server and read by Logstash Due to size of classads on Elasticsearch and because ES only works on data in memory, data goes into a new index each month

Indexing schedd data Python script is run every minute by a cronjob and collects classads for all jobs. The complete set of job classads is put into an ES index for that week. Script also inserts a record with number of jobs in each state into another ES index.

Querying/Visualizing information All of this information is good, but need a way of querying and visualizing it Luckily, Elasticsearch integrates with Kibana which provides a web interface for querying/visualization

Kibana Overview Query field Hits over time Sampling of documents found Time range selector

Plotting/Dashboards with Kibana Can also generate plots of data with Kibana as well as dashboards Plots and dashboards can be exported!

Some (potentially interesting) plots Queries/plots developed in Kibana, data exported as csv file and plotted using RStudio/ggplot

Average number of job starts over time 94% of jobs succeed on first attempt Note: most jobs use periodic hold and period release ClassAds to retry failed jobs Thus invalid submissions may result in multiple restarts, inflating this number

Average job duration on compute node Plot of the average job duration on compute nodes, majority of jobs complete within an hour with a few projects having outliers (mean: 1050s, median: 657s )

Average bytes transferred per job using HTCondor Most jobs transfer significantly less than 1GB per job through HTCondor (mean: 214MB, median: 71MB) Note: this does not take into account data transferred by other methods (e.g. wget)

Memory usage Most jobs use less than 2GB of memory with proteomics and snoplus projects having the highest average memory utilization Mean use: 378 MB Median use: 118 MB

Other uses Analytics framework is not just for dashboards and pretty plots Can also be used to troubleshoot issues

Troubleshooting a job In November 2014, there was a spike in the number of restarts, would like to investigate this and see if we can determine the cause and fix

Troubleshooting part 2 Select the time range in question algdock project seems to be responsible for most of the restarts Split out the projects with the most restarts

Troubleshooting concluded Next, we go to the discover tab in Kibana and search for records in the relevant time frame and look at the classads, it quickly becomes apparently that the problem is due to a missing output file that results in the job being held combined with a PeriodicRelease for the job

Future directions Update schedd probe to use LogReader API to get classad updates rather than querying schedd for all classads More and better analytics: o o Explore data more and pull out analytics that are useful for operational needs Update user dashboards to present information that users are interested in

Links Github repo - https://github.com/dhtc- Tools/logstash-confs/tree/master/condor

Questions? Comments?