Nagios and Cloud Computing

Similar documents

IOS110. Virtualization 5/27/2014 1

Mobile Cloud Computing T Open Source IaaS

Virtualization & Cloud Computing (2W-VnCC)

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

Installing & Using KVM with Virtual Machine Manager COSC 495

Enabling Technologies for Distributed and Cloud Computing

Operating Systems Virtualization mechanisms

Intro to Virtualization

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

Introduction to OpenStack

Building a Cloud Computing Platform based on Open Source Software Donghoon Kim ( donghoon.kim@kt.com ) Yoonbum Huh ( huhbum@kt.

Cloud Platform Comparison: CloudStack, Eucalyptus, vcloud Director and OpenStack

Sistemi Operativi e Reti. Cloud Computing

Alfresco Enterprise on AWS: Reference Architecture

2) Xen Hypervisor 3) UEC

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

9/26/2011. What is Virtualization? What are the different types of virtualization.

VMware Server 2.0 Essentials. Virtualization Deployment and Management

ArcGIS for Server: In the Cloud

About the VM-Series Firewall

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform

Enabling Technologies for Distributed Computing

Maintaining Non-Stop Services with Multi Layer Monitoring

Cloud Computing: Making the right choices

CLOUD COMPUTING & SECURITY -A PRACTICAL APPROACH

Availability Digest. Redundant Load Balancing for High Availability July 2013

How To Make A Virtual Machine Aware Of A Network On A Physical Server

PHD Virtual Backup for Hyper-V

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies. Virtualization of Clusters and Data Centers

ZEN LOAD BALANCER EE v3.02 DATASHEET The Load Balancing made easy

A cure for Virtual Insanity: A vendor-neutral introduction to virtualization without the hype

Performance Comparison of VMware and Xen Hypervisor on Guest OS

Amazon EC2 XenApp Scalability Analysis

Architecting ColdFusion For Scalability And High Availability. Ryan Stewart Platform Evangelist

A SURVEY ON AUTOMATED SERVER MONITORING

Acronis Backup & Recovery 11.5

Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness

KT ucloud storage. Two Years of Life with OpenStack Swift / Jaesuk Ahn, Cloud OS Dev. Team, Korea Telecom

CS312 Solutions #6. March 13, 2015

Comparison and Evaluation of Open-source Cloud Management Software

How To Monitor Your Computer With Nagiostee.Org (Nagios)

Peter Ruissen Marju Jalloh

JOB ORIENTED VMWARE TRAINING INSTITUTE IN CHENNAI

Best of Breed of an ITIL based IT Monitoring. The System Management strategy of NetEye

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

FleSSR Project: Installing Eucalyptus Open Source Cloud Solution at Oxford e- Research Centre

Scalable Architecture on Amazon AWS Cloud

Citrix XenServer 5.6 OpenSource Xen 2.6 on RHEL 5 OpenSource Xen 3.2 on Debian 5.0(Lenny)

Deployment Options for Microsoft Hyper-V Server

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Active Fabric Manager (AFM) Plug-in for VMware vcenter Virtual Distributed Switch (VDS) CLI Guide

Ensure that the server where you install the Primary Server software meets the following requirements: Item Requirements Additional Details

Data Centers and Cloud Computing

Cloud.com CloudStack Community Edition 2.1 Beta Installation Guide

Week Overview. Installing Linux Linux on your Desktop Virtualization Basic Linux system administration

OpenNebula The Open Source Solution for Data Center Virtualization

OpenStack Ecosystem and Xen Cloud Platform

How To Use Arcgis For Free On A Gdb (For A Gis Server) For A Small Business

Tools and strategies to monitor the ATLAS online computing farm

Rally Installation Guide

T Mobile Cloud Computing Private Cloud & Assignment

NOC PS manual. Copyright Maxnet All rights reserved. Page 1/45 NOC-PS Manuel EN version 1.3

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing. Chapter 1 Introducing Cloud Computing

VIRTUALIZATION 101. Brainstorm Conference 2013 PRESENTER INTRODUCTIONS

OpenStack Alberto Molina Coballes

Availability Management Nagios overview. TEIN2 training Bangkok September 2005

Comparing Open Source Private Cloud (IaaS) Platforms

Build Your Own Performance Test Lab in the Cloud. Leslie Segal Testware Associate, Inc.

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

Unitrends Virtual Backup Installation Guide Version 8.0

Applying the Benefits of Cloud and Clustering to your Shared Hosting Platform

Chapter 1 - Web Server Management and Cluster Topology

Alfresco Enterprise on Azure: Reference Architecture. September 2014

Lecture 2 Cloud Computing & Virtualization. Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Nagios. cooler than it looks. Wednesday, 31 October 2007

Options in Open Source Virtualization and Cloud Computing. Andrew Hadinyoto Republic Polytechnic

Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Best Practices for Python in the Cloud: Lessons

Security and Billing for Azure Pack. Presented by 5nine Software and Cloud Cruiser

A High Availability Clusters Model Combined with Load Balancing and Shared Storage Technologies for Web Servers

Moving SNE to the Cloud

ServerPronto Cloud User Guide

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

Developing a dynamic, real-time IT infrastructure with Red Hat integrated virtualization

Snapt Redundancy Manual

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

How To Create A Cloud Based System For Aaas (Networking)

Content Distribution Management

SkySQL Data Suite. A New Open Source Approach to MySQL Distributed Systems. Serge Frezefond V

Transcription:

Nagios and Cloud Computing Presentation by William Leibzon (william@leibzon.org) Nagios Thanks for being here! Open Source System Management Conference May 10, 2012 Bolzano, Italy

Cloud Computing What is Cloud Computing? Virtualized systems, independent of hardware and leased to customers in what is referred to as Infrastructure as a Service Virtualization is the Core of Cloud Computing Separates Hardware from Operating System Efficient use of modern multi-core processors More Servers with Less Hardware Unused system resources can be utilized in other types of servers with different resource usage Less energy, Less Rack Space, More Efficient efficient use of resources

Benefits of Cloud Architecture Virtualized Systems in a Cloud Can be managed entirely remotely Can move (even live) from one hardware to another Can be shutdown, saved to disk and started again when required Can be easily cloned to have another alike system started exactly when it is needed Cloud allows to automate scaling up of infrastructure to handle peak traffic load while scaling down after to keep overall cost low This requires monitoring of all system resources!

Cloud Solutions and Vendors Hypervisors (Viritualization Kernels): Commercial: VMware ESX, IBM Z/VM, Microsoft VirtualPC Open-Source: Xen, KVM, OpenVZ, Quemu, VirtualBox Xen originally implemented paravirtualization, requirying Linux with modified cloud os kernel. KVM and Xen-HVM do full virtualization with Quemu and CPU virtualization extensions (Intel's VT or AMD's SVM). OpenVZ is hybrid of parvirtualizer and user-mode linux and requires both host and cloud os to be same version of Linux, sharing kernel. Virtualization and Cloud Software Suites Commercial: VMware vcloud, Microsoft Azure Open-Source: Eucalyptus, OpenStack, OpenNebula, Baracus Commercial based on Open-Source: Citrix XenServer, Oracle VM, Ubuntu Enterprise Cloud, Redhat CloudForms, Parallels Virtuozzo Cloud Infrastructure providers Amazon EC2 (Xen), Rackspace (modified Xen), Linode (Xen), Savvis (Vmware), many many more...

Of above I recommend OpenStack with KVM or Eucalyptus with Xen. OpenVZ provides best peformance but no true isoluation. Open-Source Cloud Software Open-Source Hypervisors used in Cloud Systems Xen - http://www.xen.org/ KVM - http://www.linux-kvm.org/ OpenVZ - http://www.openvz.org/ Open-Source Cloud Management Software Eucalyptus - http://open.eucalyptus.com/ OpenStack http://www.openstack.org/ OpenNebula - http://www.opennebula.org/ Baracus http://baracus-project.org/ Proxmox - http://pve.proxmox.com/

Monitoring for the Cloud Monitoring of hardware (host OS) & hypervisor More static, hardware does not change as often Monitoring of system resources often integrated into virtualizer and info not available to cloud customer Monitoring of virtual systems Dynamic, should be able to handle addition and removal of server instances Focus on application and network performance Ideally should monitor utilization and be able to launch new server instances (auto-scaling) Monitoring system should itself be robust and handle more servers without impacting performance

This means cluster!!! Cloud Monitoring Architecture Horizontal Scaling Clouds can be as small as 10 servers and as as large as 10,000+. When developing architecture, you need to support its future growth. Scaling on Demand A pro-active system should handle big changes in the number of cloud instances. You may have 2 webserver instances at 6am and grow to 20 at 10pm. High Availability Good system design should be fully fault-tolerant and application as a whole should continue to function without interruption if any one server instance dies

Nagios Cluster Options The base nagios-core package is for stand-alone monitoring where server does all service checks. It can be extended to Nagios Cluster with : Passive Service Checks (Classic Distributed Model) Old Way - NCSA used to forward results of checks from clients to main nagios server, not robust Shared database (Central Dashboard Model) NDO-Mod and Merlin projects implement this with a combination of NEB modules, daemon & database Worker Nodes (Load Balancing of Checks) DNX and Mod-Gearman do it with combination of loaded NEB module, server daemon & client servers

Passive Service Checks NCSA NCSA Nagios Client Server Nagios Client Server How - One central server with all services, it does not do any checks listing them all passive - Separate client nagios servers run plugins and do checks for specific sets of hosts, each has its own subset of full nagios config - Scripts are setup that capture results from each client host and send them to central server using NSCA, it puts them into nagios command queue Advantages This will work with any nagios server, organizations have been doing it from at least 2002 Disadvantages Requires a lot of custom scripting to organize nagios configs. Not reliable if server dies. Not robust to automate cloud instances being added and deleted

Shared Database Who: NDO-DB and Merlin How - Multiple Peer Nagios servers, each has different config file specifying which services it would check - All servers use common database to share results of checks and status of services they are monitoring Advantages - There is no master nagios server. There is master DB server, however it is a better understood topic how to create a db cluster - Using NEB avoids slow command-queue processing Disadvantages Partioning of monitoring infrastructure among servers is still manual process. It is not easy to use this for dynamic cloud environment, however it works very well for fault-tolerance

DNX and Mod-Gearman Worker Nodes How - Similarly to Passive Service Checks, there is a central Nagios Server, it does not execute any plugins. - Unlike with Passive Checks, nagios does schedule checks. Thereafter NEB module takes over. - Module passes information on which plugin(s) to run to DNX server (or Gearman server for Mod-Gearman) which manages worker nodes. - Worker nodes are separate servers, each has special worker daemon running. The daemon communicates with management server and gets information (plugin command) on what to run. It then passes results back to management server and NEB module writes these results directly into nagios memory.

Advantages and Disadvantages of DNX and Mod-Gearman Advantages Robust: checks are automatically distributed among all cluster worker nodes (default round-robin on equal basis) Scalable: Fully achieves Horizontal Scaling of nagios checks Easy to Use in a Cloud Environment: - Existing worker node can be replicated with no special config to start it. Lets expand cluster on demand - All worker nodes are essentially the same and there is no additional re-configuration necessary to add a new node Efficient Integration with Nagios: Using NEB loaded modules achieves low-level integration with nagios, much better than NCSA and command queue Disadvantages Still relies on a single central nagios server, If central nagios server dies entire system is out

DNX vs Mod-Gearman Single package, no external dependencies. Includes all job cluster control components DNX Hard to maintain and test for non-linux environment Can use localcheckpattern in server configuration to direct jobs. But it is not documented Supports nagios-2.x with a patch and nagios-3.x as is Client can be extended with nagiosspecific features. Planned are: - embedded Perl, check_icmp, - check_snmp, check_nrpe Mod-Gearman is built around Gearman Project Better maintained since Gearman has many uses Enjoys benefits of wider testing on new releases Easy to configure and direct to separate queues depending on hostgroup & servicegroup Only supports nagios 3.x Mod-Gearman Supports eventhandlers and not just checks! Nagios-only features are hard to add at node level

Combining Checks Together Combining data collection for multiple services together is a great way to off-load Nagios. There are several approaches on how to do it: Old Way - cron jobs run plugins and submit results with NSCA Check-Multisite http://www.my-plugin.de/check_multi This allows to combine multiple checks together, output status is multiple lines, new feature of nagios 3.0 A separate collection daemon on a each server. This is like Old way but no longer using cron and instead dedicated process on server doing collection. Two such popular open-source packages: - Munin : http://munin-monitoring.org/ - Collectd: http://collectd.org (/wiki/index.php/collectd-nagios) I recommend Collectd because its faster and munin relies on NSCA to push data to nagios from each host where as Collectd can send data using its own protocol to nagios server and then nagios can check this data locally. For munin you can do without NSCA using http://code.google.com/p/nagios-munin/ (I haven't tried it myself though) And in general use plugins that can have multiple thresholds and do checks together instead of ones that have to be called separately for each check

Combining All Options Together All Nagios cluster options can be combined! DNX and Mod-Gearman offers horizontal scaling for all checks and relieaves Nagios of need to run them Merlin or ADO-DB can be used for failover and scaling of Nagios Server itself Munix or Collectd can run on cloud hosts to off-load gathering of data from hosts for standard checks and provide data to nagios together Collectd Collectd on each server

Ideal Fully Fault-Tolerant Nagios Cluster Architecture Nagios Web Interface Server Merlin/ADO DB Replication Merlin/ADO DB Backup Backup Nagios Web Interface Server DB Proxy heartbeat Standby DB Proxy Performance Data (RRD) Server (like NagiosGrapher) udpecho Nagios Server crossmonitor Backup Nagios Server Backup Performance Data (RRD) Server udp udp Worker Node Worker Node Worker Node Worker Node Ideally you would have each of the above as a separate cloud server, but even those with 1000s of servers may find this hard to maintain

Nagios Cloud Cluster with 4 hosts MAIN NAGIOS SERVER Apache PNP w/ RRD N P C D Mysql DB Merlin replication STANDBY NAGIOS SERVER Apache Mysql DB Merlin PNP w/ RRD N P C D Standby Server has all checks disabled (except cross-monitor of other nagios which should not use DNX) If main server dies, backup takes over and registers itself in dyndns server replacing primary. Nagios Daemon DNX Server crossmonitor Nagios Daemon DNX Server DNX Clients use dyndns address, they are restarted on server switch DNX Client DNX Client Note: I'm working a new nagios add-on for failover which would be a NEB module that will take care of cross-monitoring, switching on failure and syncing. This should be ready in late in 2012.

Configuration of a cloud host The best way to configure monitoring of cloud hosts with multiple instances is to have a template and define all services by hostgroups define host { use host_name alias address hostgroups parents contact_groups admins } wprod-server <--- Template for all Webservers w1 webserv1 <---- This is second way to search w1.dynamic.cloud1.mydomain <---- Local DNS production,loadbalanced,linux_centos5,webserv loadbalancer1,loadbalancer2 Instead of adding new host of same type when its instance starts up, I recommend adding a bunch of hosts (up to maximum you could auto-scale for the day) and disable checks by default. Checks are then enabled using nagios cmd when new server is launched and ready.

Auto-Scaling Event handlers can be used or custom check. Trigger based on total number of open http sockets (check_netstat, check_apache_status) from all servers Write custom script that keeps number of currently active servers in DB or local file to set name of new server. Have new server name as a parameter for launching cloud instance. Write startup scripts that use this to set hostname and register ip in local dynamic dns server. For Amazon EC2, aws utility is very useful to automate launching of new servers. Get it at http://timkay.com/aws/ Extra nagios worker node is launched similarly and this is triggered when enough servers have been launched. Can also do it based on nagios stats (check_nagios) Scale down after an hour or more of low resource usage, you can do it with a check that relies on RRD data

Custom check to see if new server should be started: $count=sqlexec("select COUNT(id) FROM ServerData") $sumit=sqlexec("select SUM(Connections) FROM ServerData") $lastlaunched=sqlexec("select MAX(started_on) FROM ServerData") if $sumit/$count > $threshold && ($now-$lastlatched)<600 { <figure out the name and id> launch_new_server_instance($newname) sqlexec( INSERT INTO ServerData VALUES ($newid, $newname,0,curdate()) ) enable_nagios_service_checks($newname) } Use of SQL DB for Auto-Scaling This is for illustration of logic only. Not real code. CREATE TABLE ServerData ( id bigint(10) unsigned NOT NULL, name varchar(50) unsigned default NULL, connections bigint(20) unsigned default 0, started_on date default NULL, PRIMARY KEY(id)); After you got results of server check (like event handler) you do this: UPDATE ServerData SET connections=<data from nagios check> WHERE name=<server host>

Additional Cloud Monitoring Tips Cloud Servers are not entirely independent, and other servers on same hardware server may effect yours System load checks may not be as useful with some virtualizers, and can show false spikes in load. Put larger emphasis on 15-minute load. For determening if system is too loaded and you need to launch another server, use number of TCP connections and time to process each request. If you control the cloud, find way to get their info. Create check that shows physical server name and gives link to hardware host monitoring graphs. Remember, you can just launch a new server Do not spend too much time investigating cause, take it out of production first, replace, and investigate later

Nagios Cluster Software Nagios, NDO-Utils, NCSA http://www.nagios.org/ DNX (Distributed Nagios executor) - http://dnx.sourceforge.net/ Mod-Gearman - http://labs.consol.de/lang/de/nagios/mod-gearman/ Gearman - http://gearman.org/ Merlin (Module for Effortless Redundancy and Loadbalancing by OP5) http://www.op5.org/community/plugin-inventory/op5-projects/merlin Check-Multisite (collect data from multiple servers) http://www.my-plugin.de/check_multi/ Ganglia (open-source computing cluster monitoring, can be integrated with nagios) http://www.ganglia.info

Questions?

More Questions? Feedback? William Leibzon <william@leibzon.org> My Nagios Page (mostly plugins) : http://william.leibzon.org/nagios/