Optimizing your Monitoring and Trending tools for the Cloud



Similar documents
the missing log collector Treasure Data, Inc. Muga Nishizawa

Bernd Ahlers Michael Friedrich. Log Monitoring Simplified Get the best out of Graylog2 & Icinga 2

Migrating a running service to AWS

Ganglia & Nagios. Maciej Lasyk 11. Sesja Linuksowa Wrocław, /25. Maciej Lasyk, Ganglia & Nagios

Clusters in the Cloud

TECHNOLOGY WHITE PAPER Jun 2012

Case Study : 3 different hadoop cluster deployments

Modern Web development and operations practices. Grig Gheorghiu VP Tech Operations Nasty Gal

VMware vcenter Operations Manager Administration Guide

There are numerous ways to access monitors:

Big data blue print for cloud architecture

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

Assignment # 1 (Cloud Computing Security)

Open Source Monitoring

Amazon Elastic Beanstalk

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Nagios and Cloud Computing

TECHNOLOGY WHITE PAPER Jan 2016

Wait, How Many Metrics? Monitoring at Quantcast

Scalable Architecture on Amazon AWS Cloud

Frequently Asked Questions Plus What s New for CA Application Performance Management 9.7

SIG-NOC Meeting - Stuttgart 04/08/2015 Icinga - Open Source Monitoring

FITB. Network Graphing Done Right. Laurie Denness

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Oracle Big Data SQL Technical Update

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Icinga and Puppet Dominik Schulz Head of Datacenter and Operations Magic Internet / MyVideo

Practical Hadoop. Security. Bhushan Lakhe

Network Monitoring. Review of Software

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Cloud Computing. Lecture 24 Cloud Platform Comparison

Sun Cloud API: A RESTful Open API for Cloud Computing

GroundWork Monitor Open Source Readme

VMware vcenter Operations Manager Enterprise Administration Guide

WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE

Cloud Computing with Amazon Web Services and the DevOps Methodology.

Cloud computing - Architecting in the cloud

Deploying Hadoop with Manager

Cost-Effective Business Intelligence with Red Hat and Open Source

Kafka & Redis for Big Data Solutions

CA Spectrum and CA Service Desk

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Monitoring Remedy with BMC Solutions

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Building Success on Acquia Cloud:

Application Security Best Practices. Matt Tavis Principal Solutions Architect

Architecting ColdFusion For Scalability And High Availability. Ryan Stewart Platform Evangelist

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

CAREN NOC MONITORING AND SECURITY

AV Management Dashboard

Cloud Hosting. QCLUG presentation - Aaron Johnson. Amazon AWS Heroku OpenShift

Migration Scenario: Migrating Batch Processes to the AWS Cloud

ICINGA2 OPEN SOURCE MONITORING

Hadoop Development & BI- 0 to 100

How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations

Lustre Monitoring with OpenTSDB

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

PTC System Monitor Solution Training

Building a Scalable News Feed Web Service in Clojure

How To Monitor A Server With Zabbix

Monitoring MySQL. Presented by, MySQL & O Reilly Media, Inc. A quick overview of available tools

Yahoo! Communities Architectures Ian Flint

Accelerating Wordpress for Pagerank and Profit

Availability Management Nagios overview. TEIN2 training Bangkok September 2005

An idea on how to monitor growth of database tables in an system using Bischeck

Drive new Revenue With PaaS/IaaS. Ruslan Synytsky CTO, Jelastic

Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness

Certified Big Data and Apache Hadoop Developer VS-1221

Servers. Servers. NAT Public Subnet: /20. Internet Gateway. VPC Gateway VPC: /16

GRNET NOC network monitoring & visualization tools

Magento Search Extension TECHNICAL DOCUMENTATION

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

LIGHTHOUSE. Case Study: Impact Telecom Early Warning System

Cloud Computing. Up until now

GigaSpaces Real-Time Analytics for Big Data

Peers Techno log ies Pv t. L td. HADOOP

F Cross-system event-driven scheduling. F Central console for managing your enterprise. F Automation for UNIX, Linux, and Windows servers

SANS Dshield Webhoneypot Project. OWASP November 13th, The OWASP Foundation Jason Lam

Hosting Drupal on Amazon Web Services (AWS) Heather Wozniak, Ph.D. Web Developer, UW College of Arts & Sciences hwozniak@uw.edu

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

Changing the Equation on Big Data Spending

Improve performance and availability of Banking Portal with HADOOP

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Logentries Insights: The State of Log Management & Analytics for AWS

HP LeftHand SAN Solutions

Keep an eye on your PostgreSQL clusters

Big Data with Component Based Software

Gabriel Iuga. London, United Kingdom Tel: ; Website:

Transcription:

Optimizing your Monitoring and Trending tools for the Cloud Nagios World Conference 2012 Nicolas Brousse Lead Operations Engineer September 28 th 2012

About TubeMogul What are some of our challenges? Our environment Amazon Cloud Environment Automated Monitoring Efficient on-call rotation Efficient monitoring What s next? Q&A

About TubeMogul Founded in 2006 Formerly a video distribution and analytics platform TubeMogul is a Brand-Focused Video Marketing Company Build for Branding Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com

What are some of our challenges? Monitoring between 700 to 1000 servers Servers spread across 6 different locations 4 Amazon EC2 Regions (our public cloud provider) 1 Hosted (Liquidweb) & 1 VPS (Linode) Little monitoring resources Collecting over 115,000 metrics Monitoring over 20,000 services with Nagios Multiple billions of HTTP requests a day Most of it must be served in less than 100ms Lost of traffic could mean lost of business opportunity Or worst, over-spending

Our environment Over 80 different server profiles Our stack: Java (Embedded Jetty, Tomcat) PHP, RoR Hadoop: HDFS, M/R, Hbase, Hive Couchbase MySQL Monitoring: Nagios, NSCA Graphing: Ganglia, sflow, Graphite Configuration Management: Puppet

Amazon Cloud Environment

Amazon Cloud Environment We use EC2, SDB, SQS, EMR, S3, etc. We don t use ELB We heavily use EC2 Tags ec2-describe-instances -F tag:hostname=dev-build01

Automated Monitoring

Automated Monitoring Configuring Ganglia using Puppet templates globals { override_hostname = <%= scope.lookupvar('hostname') %> } udp_send_channel { host = <%= scope.lookupvar('ec2_tag_nagios_host') %> port = 8649 ttl = 1 } sflow { udp_port = 6343 accept_jvm_metrics = yes multiple_jvm_instances = yes }

Automated Monitoring Or configuring Host sflow using Puppet templates sflow{ DNSSD = off polling = 20 sampling = 512 collector{ ip = <%= ec2_tag_nagios_host %> udpport = 6343 } }

Automated Monitoring Puppet configure our monitoring instances We use Nagios regex : use_regexp_matching=1 But we don t use true regex : use_true_regexp_matching=0 We use NSCA with Upstart We don t use the perfdata We use pre-cached objects We includes our configurations from 3 directories objects => templates, contacts, commands, event_handlers servers => contain a configuration file for each server clusters => contain a configuration file for each cluster

Automated Monitoring Process of event when starting a new host and add it to our monitoring: 1. We start a new instance using Cerveza and Cloud-init 2. Puppet configure Gmond or Host sflow on the instance 3. Our monitoring server running Gmond and Gmetad get data from the new instance 4. A Nagios check run every minute and check for new hosts 5. Look for new hosts using EC2 API Look for EC2 tag hostname to confirm it s a legit host, not a zombie / fail start Look for EC2 tag nagios_host to see if the host belong to this monitoring instance If a new host is found: We build a config for the host based on a template file and doing some string replace Once all config have been generated, we rebuild pre-cache objects and reload Nagios 6. If we find Zombie host, we generate a Warning alert 7. If the config is corrupt, we send a Critical alert

Automated Monitoring

Automated Monitoring

Efficient on-call rotation Follow the sun Some of our team is in Ukraine, no more Tier 1 night on-call for us Nagios timeperiod and escalation are a pain to maintain Nagios notification plugged to Google Calendar Using our own notification script for email and paging Google Clendar make it easy for each team to manage their own on-call calendar Support for multiple Tier and complex schedules Caching Google Calendar info locally every hour Simpler definitions and rules in Nagios contacts Notify only people on-call, unless they asked for off call emails

Efficient on-call rotation Using Google Calendar

Efficient on-call rotation Simple contact definitions Google Calendar info Tier Filter (Regex) Tier Interval (time to wait before escalating alert since last tier) Off call email

Efficient on-call rotation

Efficient on-call rotation On-call contact fetched from Google Calendar at the bottom of the alert makes our life easier!

Efficient monitoring We disable most notification and only care of a cluster status

Efficient monitoring Most of our checks are based on Ganglia RRD files

Efficient monitoring It become really easy to monitor any metrics returned by Ganglia

Efficient monitoring We can check cluster status by hosts/services but also per returned messages!

Efficient monitoring

Efficient monitoring Using Graphite Federated Storage One place to see all our metrics from all the world No delay due to rsync of RRD files Graph close to real time, delay only due to rrdcached flushing interval

What s next? Some hot topics Do trending alert with Nagios based on Graphite/Ganglia data Better automation for non-cloud servers Ensure we can scale our monitoring when using hybrid cloud (Eucalyptus) or multiple public cloud provider Get a better centralized view of our different Nagios

Ops @ TubeMogul All this wouldn t be possible without a strong System Operation team Andrey Shestakov Eamon Bisson Donahue Justin Francisconi Marylene Tanfin Nicolas Brousse Stan Rudenko

Thank you TubeMogul is Hiring! http://www.tubemogul.com/jobs jobs@tubemogul.com Follow us on Twitter @TubeMogul @orieg