Monitoring and Alerting



Similar documents
Scaling Graphite Installations

Repsheet. A Behavior Based Approach to Web Application Security. Aaron Bedra Application Security Lead Braintree Payments. tirsdag den 1.

WINDOWS SERVER MONITORING

mbits Network Operations Centrec

Tk20 Network Infrastructure

Maintaining Non-Stop Services with Multi Layer Monitoring

Heroix Longitude Quick Start Guide V7.1

FUNCTIONAL OVERVIEW

Network Management and Monitoring Software

GSX Monitor & Analyzer. for IBM Collaboration Suite

Service Level Agreement

Monitoring MySQL. Kristian Köhntopp

WhatsUp Gold v11 Features Overview

AGENDA: INTRODUCTION: 1. How is our cloud monitoring setup? 2. Which are the tools used? 3. How do we access monitoring dashboard?

Introduction. AppDynamics for Databases Version Page 1

Black-box and Gray-box Strategies for Virtual Machine Migration

How To Monitor Mysql With Zabbix

Your Server Support Looking after your servers, giving you peace of mind

echomountain Enterprise Monitoring, Notification & Reporting Services Protect your business

Monitoring IBM HMC Server. eg Enterprise v6

Network Monitoring with Xian Network Manager

The day-to-day of the IT department. What is Panda Cloud Systems Management? Benefits of Panda Cloud Systems Management

Windows Performance Monitor Troubleshooting Guide

Troubleshooting BlackBerry Enterprise Service 10 version Instructor Manual

Large-Scale Web Applications

Citrix EdgeSight User s Guide. Citrix EdgeSight for Endpoints 5.4 Citrix EdgeSight for XenApp 5.4

11 Everyday Network Issues PRTG Network Monitor Helps Resolve

VPS Cloud Hosting. Call (02)

Stopping The Application Management Blame Game Through Integrated IT Management Tools.

Davor Guttierrez 3 Gen d.o.o. Optimizing Linux Servers

NETWORK SERVICES WITH SOME CREDIT UNIONS PROCESSING 800,000 TRANSACTIONS ANNUALLY AND MOVING OVER 500 MILLION, SYSTEM UPTIME IS CRITICAL.

Monitoring Remedy with BMC Solutions

LockoutGuard v1.2 Documentation

ROCANA WHITEPAPER How to Investigate an Infrastructure Performance Problem

GSX Monitor & Analyzer for Microsoft Lync 2013

Using WebLOAD to Monitor Your Production Environment

Product Data Sheet.

K1000: Advanced Topics

SQL Sentry Essentials

HelpSystems Web Server User Guide

APPLICATION PERFORMANCE MONITORING

Using New Relic to Monitor Your Servers

White Paper. The Ten Features Your Web Application Monitoring Software Must Have. Executive Summary

Kafka & Redis for Big Data Solutions

How To Monitor A Server With Zabbix

Apache Hadoop. Alexandru Costan

PowerDNS dnsdist. OX Summit 2015 All presentations will be on:

Learn more about BI Monitoring

20 Command Line Tools to Monitor Linux Performance

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Whitepaper. A Guide to Ensuring Perfect VoIP Calls. blog.sevone.com info@sevone.com

IT Infrastructure Management

Configuring WMI Performance Monitors

Big Data with Component Based Software

Extending Your Use of Extended Events

Introduction to Network Monitoring and Management

Network Monitoring with Nagios. Matt Gracie, Information Security Administrator Canisius College, Buffalo, NY

Drupal Performance Tuning

365 Cloud Storage. Security Brief

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project

WEB SERVER MONITORING SORIN POPA

The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones

IceWarp Server. Log Analyzer. Version 10

Topics. CIT 470: Advanced Network and System Administration. Why Monitoring? Why Monitoring? Historical Monitoring Processes. Historical Monitoring

Snapt Balancer Manual

MONyog White Paper. Webyog

How To Write A Monitoring System For Free

SCALEXTREME S SERVER MONITORING ESSENTIALS

Log Analyzer Reference

See all, manage all is the new mantra at the corporate workplace today.

MySQL Enterprise Monitor

Scaling Database Performance in Azure

Network monitoring systems & tools

SolarWinds Network Performance Monitor

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

Kevin Hayes, CISSP, CISM MULTIPLY SECURITY EFFECTIVENESS WITH SIEM

Tushar Joshi Turtle Networks Ltd

Network Monitoring. Sebastian Büttrich, NSRC / IT University of Copenhagen Last edit: February 2012, ICTP Trieste

Automating System Administration with Perl

Managing your Red Hat Enterprise Linux guests with RHN Satellite

The story of dnsdist - or - Do we need a DNS Delivery Controller?

7/15/2011. Monitoring and Managing VDI. Monitoring a VDI Deployment. Veeam Monitor. Veeam Monitor

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

The enterprise class monitoring solution for everyone

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

SERVICE SCHEDULE PUBLIC CLOUD SERVICES

Focused Vendor Module Avaya Aura Communication Manager (ACM)

Drupal in the Cloud. Scaling with Drupal and Amazon Web Services. Northern Virginia Drupal Meetup

Best of Breed of an ITIL based IT Monitoring. The System Management strategy of NetEye

Transcription:

Monitoring and Alerting All the things I've tried that didn't work, plus a few others. By Aaron S. Joyner Senior System Administrator Google, Inc.

Blackbox vs Whitebox Blackbox: Requires no participation of the monitored system Observes external functionality, "what the user sees" End to end test Whitebox: Collects data intentionally provided by the target system Has more granular information about the system Can provide warning of problems before they occur Help with capacity planning, understanding of why things went (or are about to go) really sideways

Blackbox tests measure... Can you ping the webserver? Can you fetch a webpage from the webserver? Does it have the correct contents? Whitebox monitoring collects... Ethernet interface statistics (packets/bytes sent/received) System Load (cpu, memory, uptime, disk usage) Daemon statistics (qps, uptime, cpu usage, version)

Historical vs Realtime Data Historical Data "How many QPS did we see yesterday, before the DoS?" Useful for tracking traffic growth, finding trends, forensics Critical for evaluating your efforts long term (SLA/SLO) Realtime Data Useful for evaluating current health Primarily drives alerting and troubleshooting Can make for great consoles

Monitoring vs Alerting vs Notification Monitoring Collects the underlying data Organizes and displays data Alerting Define the thresholds for "questionable" and "bad" performance Express dependencies, suppress duplicate alerts Notification Something's wrong... who gets the page? Escalation: What if they don't respond?

Case Studies (with some caveats)

Host Monitoring Blackbox Reachability Remote Login (ssh, rdp) Whitebox Uptime Load CPU Usage Memory Disk IO (per disk) Disk Errors (SMART) Network Interface stats

CGI Webapp (apache+php+mysql) Blackbox Apache deliver static page "self-contained" PHP page PHP page w/ "simple" DB fetch PHP page w/ complicated DB fetch Latency of each above Whitebox Apache scrape the logs scrape mod_status PHP version, modules MySQL 'show status' 'show table status' replication status replication delay

Database (MySQL / Postgres) Blackbox Can you execute a noop query? (select 1;) Can you execute a complicated query? How long does each query take? Whitebox 'show status' 'show table status' 'show master status' 'show slave status' replication status replication delay

Mail System Blackbox Test message delivery Test message latency Spam score of test msg Whitebox Messages per second Num messages per queue CPU usage of spam scoring

Distributed indexed key/value pair storage system Stores X billion key/value pairs One frontend "index" server 15 backend "value" servers Index server looks up value from value server, returns to user Now let's design the monitoring...

Distributed indexed key/value pair storage system Blackbox What can you think of? Whitebox What can you think of?

Distributed indexed key/value pair storage system Blackbox Store a new key Check key existence Retrieve value by key Request latency for each Whitebox Index server uptime Value server uptime Index server qps Value server qps / instance Index server ram usage Value server disk usage Number of hash collisions Rate of collision resolution Replication qps Values per replication level

Distributed indexed key/value pair storage system Blackbox Store a new key Check key existence Retrieve value by key Request latency for each ^^^^^^^^^^^^^^^^^^^^^^^^ These are effectively the SLOs for your SLA Whitebox Index server uptime Value server uptime Index server qps Value server qps / instance Index server ram usage Value server disk usage Number of hash collisions Rate of collision resolution Replication qps Values per replication level

Alert Handling How to get some sleep but keep everything running.

The usual way Page when a service, host, or network is down Send an email alert for everything else Limited console, usually "alerts firing" or "problems"

Basic Suggestions No alerts via email! (It doesn't scale.) No alerts if they are not human actionable. Pages for things that will break the SLA Tickets/Bugs for things that need a human to look at them, but can wait until morning. Consoles for displaying the first two categories, and graphs or list displays of anything else that might be relevant Auto-assign the tickets/bugs to humans

Pages vs Tickets Page: Webserver not responding Page latency is > SLA Disk errors on the DB Alert: Disk is over 90% capacity Test mail delivery took 5m High load on a machine Test email undelivered >30m

The Playbook So you got a page... now what?

Recipe for a good Playbook Problem description(s) If necessary: how to narrow down the problem Severity Suggested resolution If you add an alert, you're responsible for adding an entry for it. If you work an alert from the playbook that isn't right, fix it. If you have playbook entries with more than one resolution, you probably need more focused monitoring and corresponding alerts. Remember, playbook customers are sleepy.

Questions?