Monitoring HTCondor with Ganglia

Similar documents

Wait, How Many Metrics? Monitoring at Quantcast

HTCondor at the RAL Tier-1

Ganglia & Nagios. Maciej Lasyk 11. Sesja Linuksowa Wrocław, /25. Maciej Lasyk, Ganglia & Nagios

Monitoring Infrastructure for Superclusters: Experiences at MareNostrum

Scaling Graphite Installations

GANGLIA INSTALLATION GUIDE

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

Amazon EC2 Product Details Page 1 of 5

CycleServer Grid Engine Support Install Guide. version 1.25

Lustre & Cluster. - monitoring the whole thing Erich Focht

Analyzing large flow data sets using. visualization tools. modern open-source data search and. FloCon Max Putas

Jean-Armand Broyelle / Sebastien Chabrolles IBM Power Benchmark Center Europe NMON. Monitoring. Common IBM Corporation

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

A Year of HTCondor Monitoring. Lincoln Bryant Suchandra Thapa

New Relic & JMeter - Perfect Performance Testing

SAIP 2012 Performance Engineering

Monitoring Oracle Enterprise Performance Management System Release Deployments from Oracle Enterprise Manager 12c

Load balancing MySQL with HaProxy. Peter Boros Percona 4/23/13 Santa Clara, CA

SEE-GRID-SCI. SEE-GRID-SCI USER FORUM 2009 Turkey, Istanbul December, 2009

Monitoring Clusters and Grids

Deliverable D6.7. Performance testing of cloud applications, Final Release

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

Monitoring Elastic Cloud Services

Comparison of Windows IaaS Environments

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Integration of the OCM-G Monitoring System into the MonALISA Infrastructure

MONITORING PERFORMANCE IN WINDOWS 7

Getting Started with SandStorm NoSQL Benchmark

vrealize Operations Management Pack for vcloud Air 2.0

Optimizing your Monitoring and Trending tools for the Cloud

Monitoring Load-Balancing Services

Project Convergence: Integrating Data Grids and Compute Grids. Eugene Steinberg, CTO Grid Dynamics May, 2008

Emerald. Network Collector Version 4.0. Emerald Management Suite IEA Software, Inc.

Oracle Exam 1z0-102 Oracle Weblogic Server 11g: System Administration I Version: 9.0 [ Total Questions: 111 ]

The ganglia distributed monitoring system: design, implementation, and experience

Oracle Database 11g: RAC Administration Release 2

WHITE PAPER Redefining Monitoring for Today s Modern IT Infrastructures

IBM InfoSphere MDM Server v9.0. Version: Demo. Page <<1/11>>

Condor: Grid Scheduler and the Cloud

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Exam Name: IBM InfoSphere MDM Server v9.0

HTCondor within the European Grid & in the Cloud

CA Nimsoft Monitor Snap

So in order to grab all the visitors requests we add to our workbench a non-test-element of the proxy type.

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

Bernd Ahlers Michael Friedrich. Log Monitoring Simplified Get the best out of Graylog2 & Icinga 2

Stratusphere Solutions

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

TIBCO Spotfire Metrics Modeler User s Guide. Software Release 6.0 November 2013

DEPLOYING EMC DOCUMENTUM BUSINESS ACTIVITY MONITOR SERVER ON IBM WEBSPHERE APPLICATION SERVER CLUSTER

Monitoring systems: Concepts and tools

LiteSpeed for SQL Server(7.5) How to Diagnose & Troubleshoot Backup

Guide to the LBaaS plugin ver for Fuel

Session Storage in Zend Server Cluster Manager

ITG Software Engineering

Vidi NMs Network Management

ESX System Analyzer Version 1.0 Installation Guide

Management & Analysis of Big Data in Zenith Team

Hadoop: Embracing future hardware

1Z Oracle Weblogic Server 11g: System Administration I. Version: Demo. Page <<1/7>>

Scaling out a SharePoint Farm and Configuring Network Load Balancing on the Web Servers. Steve Smith Combined Knowledge MVP SharePoint Server

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

Evaluation of Nagios for Real-time Cloud Virtual Machine Monitoring

Load Balancing Bloxx Web Filter. Deployment Guide

Architektur XenServer

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Scalability Tuning vcenter Operations Manager for View 1.0

User and Programmer Guide for the FI- STAR Monitoring Service SE

Configuring Apache HTTP Server With Pramati

Job Scheduling with the Fair and Capacity Schedulers

Orchestrating Distributed Deployments with Docker and Containers 1 / 30

Building a Scalable News Feed Web Service in Clojure

Running a Workflow on a PowerCenter Grid

Evaluation and implementation of CEP mechanisms to act upon infrastructure metrics monitored by Ganglia

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

HAProxy. Free, Fast High Availability and Load Balancing. Adam Thornton 10 September 2014

Grid Computing in SAS 9.4 Third Edition

By Wick Gankanda Updated: August 8, 2012

MALAYSIAN PUBLIC SECTOR OPEN SOURCE SOFTWARE (OSS) PROGRAMME. COMPARISON REPORT ON NETWORK MONITORING SYSTEMS (Nagios and Zabbix)

Load Balancing Clearswift Secure Web Gateway

Monitoring Cluster on Online Compiler with Ganglia

JOB ORIENTED VMWARE TRAINING INSTITUTE IN CHENNAI

Transcription:

Monitoring HTCondor with Ganglia

Ganglia Overview Scalable distributed monitoring for HPC clusters Two daemons gmond every host; collects and send metrics gmetad single host; persists metrics from local gmond in RRD Web Frontend Presents graphs from persistent data

3

4 Why Ganglia? Widely used monitoring system for cluster and grids Easy to add new metrics Can create custom graphs

Running condor_gangliad condor_gangliad runs on a single host Gathers daemon ClassAds from the Collector Publishes metrics to ganglia with host spoofing Can be on any host May be co-located with condor_collector gmetad Consider network traffic

Put Them Together 6

Possible Deployments Ganglia is already used for monitoring Start condor_gangliad on gmetad host Least configuration Start condor_gangliad on Central Manager Saves network traffic Ganglia is not in use for monitoring Setup dedicated host to run ganglia and condor_gangliad Generates graphs for web pages on demand

Ganglia Interface Uses gmetric method to add metrics to ganglia Uses shared library on system to send updates Fast and efficient Falls back to using gmetric command Much slower Uses gstat to determine which hosts are already monitored by ganglia

Configuration Macros GANGLIA_GSTAT_COMMAND Defaults to localhost (change master gmond running elsewhere) gstat --all --mpifile --gmond_ip=localhost gmond_port=8649 GANGLIA_SEND_DATA_FOR_ALL_HOSTS Set to true if want hosts not currently in ganglia GANGLIAD_VERBOSITY Defaults to 0, set higher for more monitoring

Running condor_gangliad Add to DAEMON_LIST DAEMON_LIST =, GANGLIAD Check GangliadLog for gmetric integration Look for libganglia load message Library has been stable over many releases May have to specify path to library If fall back to gmetric command look closely at timing

Log Snippet 04/24/14 08:05:43 Testing gmetric 04/24/14 08:05:43 Loading libganglia /usr/lib64/libganglia-3.1.7.so.0.0.0 04/24/14 08:05:43 Will use libganglia to interact with ganglia. 04/24/14 08:06:03 Starting update... 04/24/14 08:06:03 Ganglia is monitoring 1 hosts 04/24/14 08:06:10 Got 7687 daemon ads 04/24/14 08:06:14 Ganglia metrics sent: 1858 04/24/14 08:06:14 Heartbeats sent: 80

Limit Data GANGLIAD_PER_EXECUTE_NODE_METRICS Set to false if large pool Use Requirement express to limit data fetched GANGLIAD_REQUIREMENTS = Machine == "cm.chtc.wisc.edu" Machine == "submit-1.chtc.wisc.edu" Machine == "submit-2.chtc.wisc.edu" Machine == "submit-3.chtc.wisc.edu"

Metrics to Track Described in /etc/condor/ganglia.d/ Default set provided Expressed as ClassAds Name: Unique metric name used by ganglia Value: ClassAd expression, defaults to Name

Minimal Example [ ] Name = "JobsSubmitted"; Desc = "Number of jobs submitted"; Units = "jobs"; TargetType = "Scheduler";

Simple Example [ Name = strcat(mytype,"daemoncoredutycycle"); Value = RecentDaemonCoreDutyCycle; Desc = "Recent fraction of busy time in the daemon event loop"; Scale = 100; Units = "%"; TargetType = "Scheduler,Negotiator,Machine_slot1"; ]

Aggregate Metrics Can aggregate metrics over entire pool Sums: running jobs over pool Min and Max: Space Available Average Aggregates appear in HTCondor Pool group on central manager

Aggregate Example [ Name = "TotalJobAds"; Desc = "Number of jobs currently in this schedd's queue"; Units = "jobs"; TargetType = "Scheduler"; ] [ Aggregate = "SUM"; Name = "Jobs in Pool"; Value = TotalJobAds; Desc = "Number of jobs currently in schedds reporting to this pool"; Units = "jobs"; TargetType = "Scheduler"; ]

Scaling Example [ ] Name = strcat(mytype,"monitorselfresidentsetsize"); Value = MonitorSelfResidentSetSize; Verbosity = 1; Desc = "RAM allocated to this daemon"; Units = "bytes"; Scale = 1024; Type = "float"; TargetType = "Scheduler,Negotiator,Machine_slot1";

Other Attributes Title = Graph Title (defaults to Name) Regex = for dynamic metric (users) Type = automatic based on type Coerce integers to floats if scaling or large Group = Group on Web Page

Future Work Composite graphs For example, I/O load and throughput Better able to draw conclusions Graph slot states Determine which metrics are most useful

Live Demo http://timt.chtc.wisc.edu/ganglia http://cm.batlab.org/ganglia