Scaling Graphite Installations



Similar documents
SCF/FEF Evaluation of Nagios and Zabbix Monitoring Systems. Ed Simmonds and Jason Harrington 7/20/2009

High Availability Solutions for the MariaDB and MySQL Database

Monitoring and Alerting

Wait, How Many Metrics? Monitoring at Quantcast

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Tableau Server Scalability Explained

Exploring Amazon EC2 for Scale-out Applications

Adding scalability to legacy PHP web applications. Overview. Mario Valdez-Ramirez

Maintaining Non-Stop Services with Multi Layer Monitoring

2000 databases later. Kristian Köhntopp Mittwoch, 17. April 13

Ignify ecommerce. Item Requirements Notes

Preparing a SQL Server for EmpowerID installation

Tuning Tableau Server for High Performance

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

Big Fast Data Hadoop acceleration with Flash. June 2013

Lustre Monitoring with OpenTSDB

Hardware/Software Guidelines

Tableau Server 7.0 scalability

Monitoring Cloud Applications. Amit Pathak

Bernd Ahlers Michael Friedrich. Log Monitoring Simplified Get the best out of Graylog2 & Icinga 2

White Paper. Educational. Measuring Storage Performance

SharePoint Data Management and Scalability on Microsoft SQL Server

Analysis of VDI Storage Performance During Bootstorm

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

XenDesktop 7 Database Sizing

Cloud Based Application Architectures using Smart Computing

High Availability of the Polarion Server

Analyzing large flow data sets using. visualization tools. modern open-source data search and. FloCon Max Putas

Geospatial Server Performance Colin Bertram UK User Group Meeting 23-Sep-2014

K2 [blackpearl] deployment planning

Scalable Architecture on Amazon AWS Cloud

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Benchmarking Hadoop & HBase on Violin

MakeMyTrip CUSTOMER SUCCESS STORY

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

The VMware Reference Architecture for Stateless Virtual Desktops with VMware View 4.5

FileNet System Manager Dashboard Help

Dave Stokes MySQL Community Manager

GoGrid Implement.com Configuring a SQL Server 2012 AlwaysOn Cluster

Graylog2 Lennart Koopmann, OSDC /

Deploying and Optimizing SQL Server for Virtual Machines

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Stratusphere Solutions

WHITE PAPER Redefining Monitoring for Today s Modern IT Infrastructures

MySQL: Cloud vs Bare Metal, Performance and Reliability

Preparing an IIS Server for EmpowerID installation

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Seradex White Paper. Focus on these points for optimizing the performance of a Seradex ERP SQL database:

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project

Maximizing SQL Server Virtualization Performance

Enhancing SQL Server Performance

Windows Server Performance Monitoring

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

MALAYSIAN PUBLIC SECTOR OPEN SOURCE SOFTWARE (OSS) PROGRAMME. COMPARISON REPORT ON NETWORK MONITORING SYSTEMS (Nagios and Zabbix)

Best Practices Guide Revision B. McAfee epolicy Orchestrator Software

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

ROCANA WHITEPAPER How to Investigate an Infrastructure Performance Problem

End Your Data Center Logging Chaos with VMware vcenter Log Insight

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Availability Guide for Deploying SQL Server on VMware vsphere. August 2009

Microsoft SQL Server performance tuning for Microsoft Dynamics NAV

Contents Introduction... 5 Deployment Considerations... 9 Deployment Architectures... 11

SQL Server Solutions GETTING STARTED WITH. SQL Safe Backup

High Availability Solutions for MySQL. Lenz Grimmer DrupalCon 2008, Szeged, Hungary

Using New Relic to Monitor Your Servers

Zadara Storage Cloud A

Antelope Enterprise. Electronic Documents Management System and Workflow Engine

XDB. Shared MySQL hosting at Facebook scale. Evan Elias Percona Live MySQL Conference, April 2015

StorPool Distributed Storage Software Technical Overview

PARALLELS CLOUD STORAGE

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

SolarWinds. Packet Analysis Sensor Deployment Guide

Scaling Database Performance in Azure

Mike: Alright welcome to episode three of Server Talk, I m here with Alexey. I m Mike. Alexey, how are things been going, man?

McAfee Enterprise Mobility Management Performance and Scalability Guide

Tools and strategies to monitor the ATLAS online computing farm

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

PANDORA FMS NETWORK DEVICE MONITORING

HP OpenView Service Desk Process Insight 2.10 software

Probably the only database Citrix people cared about

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Capacity planning for IBM Power Systems using LPAR2RRD.

Best Practices for Virtualised SharePoint

1. Server Microsoft FEP Instalation

Monitoring Galera with MONyog

ENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE

Juniper Secure Analytics Release Notes

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

Contents. SnapComms Data Protection Recommendations

Transcription:

Scaling Graphite Installations

Graphite basics Graphite is a web based Graphing program for time series data series plots. Written in Python Consists of multiple separate daemons Has it's own storage backend Like RRD, but with more features

Moving parts Whisper/Ceres The storage backend Webapp Web frontend, and API provider Relaying daemons Event based daemons Matches input based on name Relays to one or more destinations based on rules or hashing

Original production setup A small cluster We were planning to grow slowly RAID 1+0 spinning disk setup It works for our databases Ran into the IO wall Spinning rust sucks at IO Whisper updates force crazy seek patterns

Scaling problems We started with hosts in a /24 feeding one box. We ran into IO issues when we added the second /24. On the second day

Sharding Added more backends Manual rules to split traffic coming to the Graphite setup to storage nodes This becomes hard to maintain and balance

Speeding up IO Move to 400 GB SSDs from HP in RAID 1. We got performance Not as much as we needed Losing a SSD meant the host crashed Negating the whole RAID 1 setup SSDs aren't as reliable as spinning rust in high update scenarios

Naming conventions None in the beginning We adopted sys.* for system metrics user.* for user testing metrics Anything else that made sense

Metrics collectors Collectd ran into memory problems Used too much RAM Switch to Diamond Python application Base framework + metric collection scripts Added custom patches for internal metrics

Relaying We started with relays only on the cluster Relaying was done based on regex matching Ran into CPU bottlenecks as we added nodes Spun up relay nodes in each datacenter Did not account for organisational growth CPU was still a bottleneck Ran multiple relays on each host Haproxy used as a load balancer Pacemaker used for cluster failover Rewrite in C http://github.com/grobian/carbon-c-relay

statsd We added statsd early on We didn't use it for quite some time Found that our PCI vulnerability scanner reliably crashed it Patched it to handle errors, log and throw away bad input The first major use was for throttling external provider input We use this only for metrics from a couple of applications.

Business metrics Turns out, our developers like Graphite They didn't understand RRD/Whisper semantics though Treat graphite queries as if they were SQL Create a very large number of named metrics Not much data in each metric, but the request was for 5.3TiB of space

Sharding take 2 Manually maintaining regexes became painful Two datacenters 10 backend servers Keeping disk usage balanced was even harder We didn't know who would create metrics and when (this is a feature, not a bug)

Sharding take 2 Introduce hashing Switch from RAID 1 to RAID 0 Store data in two locations in a ring Mirror rings between datacenters Move metrics around so we don't lose data Ugly shell scripts to synchronise data between datacenters. http://github.com/jssjr/carbonatedoes the same things, but is already out there.

Current status (Disk IOPS)

Using Graphite Graphs Time series data (default graphs) Sparklines (via API) Dashboards Developers create their own Overhead displays Additional charting libraries D3.js, Rickshaw Nagios Trend based alerting Passive checks

Current problems Hardware CPU usage Disk IO Disks Too easy to saturate We saturate disks Reading can get a bit slow SSDs die under update load A disk lasts between 12 to 18 months.

More interesting problems Software The frontend melts down at a few thousand hosts in sys.* We have had problems recording data after upgrading whisper Horizontal scalability Adding shards is hard Replacing SSDs is getting a bit expensive People Want a graph, throw the data at Graphite Even if it isn't time series data or one record a day

Things we are looking at Second order rate of change alerting Not just the trend, the rate at which it changes Hbase/Cassandra/RIAK for storage Anomaly detection Skyline, etc Tracking even more business metrics Hiring people to work on such fun problems Developers, Sysadmins... http://www.booking.com/jobs

?