Flying Circus RCA report #13271 2014-03-17. FOCAL POINT! Reduced storage performance causing outages in customer applications!



Similar documents
Windows Server 2008 R2 Hyper-V Live Migration

Virtual SAN Design and Deployment Guide

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

BridgeWays Management Pack for VMware ESX

StorPool Distributed Storage Software Technical Overview

Windows Server 2008 R2 Hyper-V Live Migration

Nutanix Tech Note. Failure Analysis All Rights Reserved, Nutanix Corporation

Analysis of VDI Storage Performance During Bootstorm

Red Hat Ceph Storage Hardware Guide

June Blade.org 2009 ALL RIGHTS RESERVED

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Zadara Storage Cloud A

Maxta Storage Platform Enterprise Storage Re-defined

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Cloud Server. Parallels. Key Features and Benefits. White Paper.

Server and Storage Virtualization with IP Storage. David Dale, NetApp

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at

Citrix XenServer Design: Designing XenServer Network Configurations

VMware vsphere-6.0 Administration Training

QNAP in vsphere Environment

EMC SCALEIO OPERATION OVERVIEW

Private Cloud Migration

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

Inside Track Research Note. In association with. Enterprise Storage Architectures. Is it only about scale up or scale out?

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

Diagnosing the cause of poor application performance

When Does Colocation Become Competitive With The Public Cloud? WHITE PAPER SEPTEMBER 2014

Support Guide Comprehensive Hosting at Nuvolat Datacenter

Solid State Storage in Massive Data Environments Erik Eyberg

When Does Colocation Become Competitive With The Public Cloud?

Parallels Cloud Server 6.0

Intel RAID Controllers

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Cisco Change Management: Best Practices White Paper

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Evaluation Report: Supporting Microsoft Exchange on the Lenovo S3200 Hybrid Array

Leveraging Virtualization for Disaster Recovery in Your Growing Business

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

HyperQ Storage Tiering White Paper

Red Hat Enterprise linux 5 Continuous Availability

Firebird and RAID. Choosing the right RAID configuration for Firebird. Paul Reeves IBPhoenix. mail:

What s New with VMware Virtual Infrastructure

What s new in Hyper-V 2012 R2

SERVICE SCHEDULE PULSANT ENTERPRISE CLOUD SERVICES

HyperQ DR Replication White Paper. The Easy Way to Protect Your Data

Dell High Availability Solutions Guide for Microsoft Hyper-V

Implementing Storage Concentrator FailOver Clusters

Global Headquarters: 5 Speen Street Framingham, MA USA P F

A virtual SAN for distributed multi-site environments

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Building the Virtual Information Infrastructure

Google

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

Windows Server 2012 R2 Hyper-V: Designing for the Real World

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

vsphere Networking ESXi 5.0 vcenter Server 5.0 EN


Optimize VMware and Hyper-V Protection with HP and Veeam

Parallels Cloud Storage

Parallels Cloud Server 6.0

High Availability (HA) Aidan Finn

Relational Databases in the Cloud

AppSense Environment Manager. Enterprise Design Guide

PARALLELS CLOUD STORAGE

BEST PRACTICES GUIDE: Nimble Storage Best Practices for Scale-Out

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

M.Sc. IT Semester III VIRTUALIZATION QUESTION BANK Unit 1 1. What is virtualization? Explain the five stage virtualization process. 2.

High Performance Tier Implementation Guideline

WHITE PAPER 1

(Formerly Double-Take Backup)

Maximizing SQL Server Virtualization Performance

Deployment Guide. How to prepare your environment for an OnApp Cloud deployment.

VMware Virtual SAN 6.0 Design and Sizing Guide

Technical Paper. Leveraging VMware Software to Provide Failover Protection for the Platform for SAS Business Analytics April 2011

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

EMC VNXe HIGH AVAILABILITY

Efficient Storage Strategies for Virtualized Data Centers

OPTIMIZING SERVER VIRTUALIZATION

Table of contents. Matching server virtualization with advanced storage virtualization

Technology Insight Series

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

Dell Virtual Remote Desktop Reference Architecture. Technical White Paper Version 1.0

MiServer and MiDatabase. Service Level Expectations. Service Definition

Module: Business Continuity

Synology High Availability (SHA)

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Cisco Active Network Abstraction Gateway High Availability Solution

ClearPass Policy Manager 6.3

Infortrend ESVA Family Enterprise Scalable Virtualized Architecture

StarWind Virtual SAN Scale-Out Architecture

Overview Customer Login Main Page VM Management Creation... 4 Editing a Virtual Machine... 6

Transcription:

Christian Theune Root Cause Analysis FOCAL POINT Reduced storage performance causing outages in customer applications WHEN from Tuesday, 2014-03-04 16:16 until Thursday, 2014-03-06 17:50 while replacing old storage servers with new servers and re-balancing hardware for improved power and network utilisation WHERE Flying Circus data center Oberhausen, public hosting cluster ACTUAL IMPACT Customer applications experienced very low storage performance causing applications to partially respond slower or become unavailable due to timeouts. POTENTIAL IMPACT Consistent storage availability and performance is critical to the performance of customer applications. Unpredictable slow-downs and unavailability would render the Flying Circus unfit for critical applications. Glossary CARTMAN codename for storage servers, suffixed with a running number CEPH The distributed object storage software used for our VM block devices. (see http:// ceph.com/docs/master/start/intro/) OSD A physical or logical storage unit (e.g., LUN); Ceph users often conflate the term OSD with Ceph OSD Daemon. (from http://ceph.com/docs/master/glossary/#term-osd) DC Data center BACKFILL A special case of recovery. (see http://ceph.com/docs/master/dev/placement-group/ #user-visible-pg-states)

Timeline Time Action / Effect Monday 2014-03-03 Handed provisioned storage servers to UPS for data center shipping. Tuesday 2014-03-04 before 12:00 before 15:00 UPS delivered new servers to data center. Servers were installed to racks in powered-off state by DC personnel according to our instructions. 15:00-15:30 Servers were powered on and checked for correct inclusion into our DC environment without any activated higher-level functions. 15:52 Start enabling Ceph on cartman06 16:16-16:39 Full storage outage due to Ceph placement groups stuck in peering status. 16:39 Disable flow control for cartman06 on the attached switch. 17:00 Disable flow control on all switch ports. 17:23 Start enabling Ceph on cartman07 17:00 A single drive fault on cartman06 occured. Ceph removed the affected disk from the cluster, cartman06 continues correct operation. 18:01-18:03 Customer services show slow responses / timeouts. 23:23 Start enabling Ceph on cartman08 23:28 Start enabling Ceph on cartman09 23:45 RAID controller on cartman09 started showing failures. Removed cartman09 from cluster again. Wednesday 2014-03-05 13:23-13:26 Customer services show slow responses / timeouts. 15:30 Reduce Ceph recovery traffic, report on stuck requests more aggressively, avoid superfluous restarts of OSDs. 16:07-16:08 Customer services show slow responses / timeouts. 17:20-17:21 Customer services show slow responses / timeouts. Thursday 2014-03-06 On this day, two of our engineers (STW, CT) were at our DC premises, performing the tasks described below. 08:28-08:33 Customer services show slow responses / timeouts. 08:50-09:03 Customer services show slow responses / timeouts. 09:03 Remove old, evacuated storage servers from racks.

Time Action / Effect 09:20 Replace failed drive in cartman06 with new drive. 09:20-09:43 Customer services show slow responses / timeouts. 09:30 Move cartman09 from rack OB-4-D5 to OB-4-A5. Replace failed RAID controller on cartman09. 10:00-10:08 Customer services show slow responses / timeouts. 10:22-10:45 Customer services show slow responses / timeouts. 11:30 Reduce Ceph OSD worker threads back to default. 12:00 Move cartman06 from rack OB-4-D5 to OB-4-A5. 12:07-12:20 Customer services show slow responses / timeouts. 13:21-14:18 Customer services show slow responses / timeouts. 15:41 Change IO scheduler to CFQ for physical drives. 16:06-17:50 Customer services show slow responses / timeouts. 17:02 Further, extreme reduction of Ceph backfill and recovery rates. Storage performance became reliable from this point on. No further outages related to this incident occured in the next days.

Causes NETWORKING: FLOW CONTROL The first outage occurred on Tuesday while making the first new server a member of the existing Ceph cluster. The outage showed that the server was picked up by the cluster but placement groups never left the status peering. After checking individual parts of the server s automatically generated configuration we saw package loss for jumbo frames. We found that the switch showed error messages for the storage backend network used for Ceph s replication traffic. The switch settings showed that the flow control option was enabled which is incompatible with the jumbo frames feature. We disabled the flow control option and found traffic to resume on the port and the new storage server becoming a member of the cluster and successfully taking over data and traffic. flow-control has not been an option that we manage within our policies, thus settings on the switches exhibited arbitrary settings. Even though the setting theoretically conflicts with the jumbo frame option, we found our switches to always correctly prefer the jumbo frame option over the flow control. We suspect that the conflicting setting became hostile due to the fact that we moved the port from a different non-jumbo-frames VLAN to a jumbo-frame enabled VLAN. CEPH: RECOVERY TRAFFIC IMPACT Subsequent outages were not caused by networking issues but by the fact that Ceph s internal reorganisation caused enough traffic to block ongoing read and write operations from the applications. Recovery traffic was necessary to move data from the old servers to the new ones. We did this step by step, evacuating one server at a time and found different angles on the performance that we were able to improve over time. However,tuning required many small and carefully executed steps to avoid making things worse. At times the slow responses caused applications or VMs to behave extremely sluggish including failed requests due to timeouts. Ceph allows tight control over how many worker threads perform which tasks and at what rates. We reduced the amount of workers and traffic rates for recovery traffic over time until reorganisation was not longer an issue. We achieved the final settings for this on Thursday evening. Tuning the recovery rates to this low level is a trade-off: it improves the performance of the applications under recovery scenarios but lengthens the window in which double-faults may occur. SERVER HARDWARE: PERFORMANCE CHARACTERISTICS The new storage servers were procured based on our experience with hardware over the last years, experience in iscsi and experience Ceph operations. We improved on the existing baseline by using larger, more affordable disks for storage with SSDs for journalling as well as a lot more RAM. In addition we referenced and cross-checked the Ceph hardware recommendations.

The boilerplate hardware was the same as before, using standard Thomas Krenn (TK) chassis, boards,cpus, controllers, etc. that are already in use in the Flying Circus. We provisioned the servers as usually and moved them to the data center right away on the assumption that their characteristics would be at least equal to or even better than the current setup. It turned out that our configuration did not perform as expected. We experienced a head-ofline blocking problem in the controller s write-back cache so that the strongly differing performance characteristics of SSDs and magnetic disks caused the SSDs to have to wait for any requests that were being served by the magnetic disks. As the magnetic disks were (intentionally) slower than on the previous servers this caused overall performance to drop dramatically. When noticing the bad performance we got in touch with our vendor who helped us diagnose the situation quickly. We reconfigured the cache and performed firmware updates for the affected controllers based on the vendor recommendations. SERVER HARDWARE: FAILED COMPONENTS We intended to replace our existing cluster of 7 old machines with a new cluster of 4 better machines. While migrating data to the new servers we faced failed components so that we actually only could deploy to 3 servers as one RAID controller and one disk died. This put more strain on the already busy machines while we were dealing with the performance impact mentioned above. Ceph itself was handling those errors gracefully and we did not loose any data because of those failures, but it both increased the recovery traffic overall and put more load on fewer servers. FOLLOWING THE PLAN VS. IMPROVISATION We had a ready-made plan available for the roll-out of the servers that we kept close and followed where possible. We also know that we have to improvise upon unexpected conditions. In this specific case we encounter mixed feelings about the balance that we followed this time. Visiting the data center according to schedule was helpful as we were able to fix the broken components quickly and with short communication cycles. We were also able to update our inventory after having exchanged a few servers with only remote-hands support. This helped us understand our situation and keep track of our options while moving servers around. It was also helpful to remove additional maintenance tasks from our schedule as it became clear that the situation was having a bad impact on our customers ability to use our infrastructure properly. We originally intended to deploy a new set of routers which we postponed: we installed them physically but have left them switched off for now. Additionally we moved the storage servers again to achieve a better utilisation of power, network resources, and rack capacity. Due to operational errors this caused additional recovery traffic while we were still dealing with the associated performance penalty. This point could have reduced overall impact, avoiding some downtime: either by postponing the non-critical task or by operationally handling the move correctly.

Short-term fixes During the impact and in the immediate days following the incident we performed the following actions to improve the situation: Applied RAID controller firmware updates as recommended by Thomas Krenn (DONE) Tuned the recovery options for Ceph to avoid making it unusable during recovery or backfill. (DONE) Updated all switch configuration to not use flow control for now. (DONE) Improved Ceph OSD startup configuration by starting them more slowly to avoid crushing the journal disks and thus achieve faster startup and reduced boot times. (DONE) We started improving our customer status communications using a new status page, available on http://status.flyingcircus.io. On this page we now report current status, representative performance statistics, outages, and ongoing maintenance. Customers can subscribe via e-mail, SMS, and RSS. (DONE) Improved RAID controller configuration to avoid head of line blocking and foster SSD journal performance (DONE) Increased block device caching for Ceph in QEMU-KVM processes. (DONE) Long-term improvements The following improvements will be performed in the future to avoid similar situations: Overall Ceph performance tuning (ongoing, #12589) Reduced strip size for pseudo RAID 0 disks Investigate improved disk and controller performance through JBOD (instead of pseudo RAID-0). Improve network latency (remove jumbo frames, re-enable flow control) Reduce Ceph journal size. Review recovery rate settings and perform risk-analysis for double faults. (#13295) Review different SSD usage with newer version of Ceph leveraging Linux bcache component for read- and write-caching. (#13296) Improve operations that require temporarily disabling Ceph servers to avoid unnecessary backfill traffic (#13297) Perform burn-in tests of hardware before deployment to data center (added to our provisioning checklists) Additionally the following even longer-term lessons will be applied: We will perform better hardware testing before procurement. Our vendor has already acknowledged his interest in us performing those tests and supporting our procurement processes in this way. We will avoid big bang hardware exchanges: even when it seems safe, we will adhere to a slower ramp-up following the one, few, many paradigm that we also follow in other deployment scenarios. Move to 10G ethernet to reduce latency and improve network bandwidth.