Flying Circus RCA report #13271 2014-03-17. FOCAL POINT! Reduced storage performance causing outages in customer applications!

Christian Theune Root Cause Analysis FOCAL POINT Reduced storage performance causing outages in customer applications WHEN from Tuesday, 2014-03-04 16:16 until Thursday, 2014-03-06 17:50 while replacing old storage servers with new servers and re-balancing hardware for improved power and network utilisation WHERE Flying Circus data center Oberhausen, public hosting cluster ACTUAL IMPACT Customer applications experienced very low storage performance causing applications to partially respond slower or become unavailable due to timeouts. POTENTIAL IMPACT Consistent storage availability and performance is critical to the performance of customer applications. Unpredictable slow-downs and unavailability would render the Flying Circus unfit for critical applications. Glossary CARTMAN codename for storage servers, suffixed with a running number CEPH The distributed object storage software used for our VM block devices. (see http:// ceph.com/docs/master/start/intro/) OSD A physical or logical storage unit (e.g., LUN); Ceph users often conflate the term OSD with Ceph OSD Daemon. (from http://ceph.com/docs/master/glossary/#term-osd) DC Data center BACKFILL A special case of recovery. (see http://ceph.com/docs/master/dev/placement-group/ #user-visible-pg-states)

Timeline Time Action / Effect Monday 2014-03-03 Handed provisioned storage servers to UPS for data center shipping. Tuesday 2014-03-04 before 12:00 before 15:00 UPS delivered new servers to data center. Servers were installed to racks in powered-off state by DC personnel according to our instructions. 15:00-15:30 Servers were powered on and checked for correct inclusion into our DC environment without any activated higher-level functions. 15:52 Start enabling Ceph on cartman06 16:16-16:39 Full storage outage due to Ceph placement groups stuck in peering status. 16:39 Disable flow control for cartman06 on the attached switch. 17:00 Disable flow control on all switch ports. 17:23 Start enabling Ceph on cartman07 17:00 A single drive fault on cartman06 occured. Ceph removed the affected disk from the cluster, cartman06 continues correct operation. 18:01-18:03 Customer services show slow responses / timeouts. 23:23 Start enabling Ceph on cartman08 23:28 Start enabling Ceph on cartman09 23:45 RAID controller on cartman09 started showing failures. Removed cartman09 from cluster again. Wednesday 2014-03-05 13:23-13:26 Customer services show slow responses / timeouts. 15:30 Reduce Ceph recovery traffic, report on stuck requests more aggressively, avoid superfluous restarts of OSDs. 16:07-16:08 Customer services show slow responses / timeouts. 17:20-17:21 Customer services show slow responses / timeouts. Thursday 2014-03-06 On this day, two of our engineers (STW, CT) were at our DC premises, performing the tasks described below. 08:28-08:33 Customer services show slow responses / timeouts. 08:50-09:03 Customer services show slow responses / timeouts. 09:03 Remove old, evacuated storage servers from racks.

Time Action / Effect 09:20 Replace failed drive in cartman06 with new drive. 09:20-09:43 Customer services show slow responses / timeouts. 09:30 Move cartman09 from rack OB-4-D5 to OB-4-A5. Replace failed RAID controller on cartman09. 10:00-10:08 Customer services show slow responses / timeouts. 10:22-10:45 Customer services show slow responses / timeouts. 11:30 Reduce Ceph OSD worker threads back to default. 12:00 Move cartman06 from rack OB-4-D5 to OB-4-A5. 12:07-12:20 Customer services show slow responses / timeouts. 13:21-14:18 Customer services show slow responses / timeouts. 15:41 Change IO scheduler to CFQ for physical drives. 16:06-17:50 Customer services show slow responses / timeouts. 17:02 Further, extreme reduction of Ceph backfill and recovery rates. Storage performance became reliable from this point on. No further outages related to this incident occured in the next days.

Causes NETWORKING: FLOW CONTROL The first outage occurred on Tuesday while making the first new server a member of the existing Ceph cluster. The outage showed that the server was picked up by the cluster but placement groups never left the status peering. After checking individual parts of the server s automatically generated configuration we saw package loss for jumbo frames. We found that the switch showed error messages for the storage backend network used for Ceph s replication traffic. The switch settings showed that the flow control option was enabled which is incompatible with the jumbo frames feature. We disabled the flow control option and found traffic to resume on the port and the new storage server becoming a member of the cluster and successfully taking over data and traffic. flow-control has not been an option that we manage within our policies, thus settings on the switches exhibited arbitrary settings. Even though the setting theoretically conflicts with the jumbo frame option, we found our switches to always correctly prefer the jumbo frame option over the flow control. We suspect that the conflicting setting became hostile due to the fact that we moved the port from a different non-jumbo-frames VLAN to a jumbo-frame enabled VLAN. CEPH: RECOVERY TRAFFIC IMPACT Subsequent outages were not caused by networking issues but by the fact that Ceph s internal reorganisation caused enough traffic to block ongoing read and write operations from the applications. Recovery traffic was necessary to move data from the old servers to the new ones. We did this step by step, evacuating one server at a time and found different angles on the performance that we were able to improve over time. However,tuning required many small and carefully executed steps to avoid making things worse. At times the slow responses caused applications or VMs to behave extremely sluggish including failed requests due to timeouts. Ceph allows tight control over how many worker threads perform which tasks and at what rates. We reduced the amount of workers and traffic rates for recovery traffic over time until reorganisation was not longer an issue. We achieved the final settings for this on Thursday evening. Tuning the recovery rates to this low level is a trade-off: it improves the performance of the applications under recovery scenarios but lengthens the window in which double-faults may occur. SERVER HARDWARE: PERFORMANCE CHARACTERISTICS The new storage servers were procured based on our experience with hardware over the last years, experience in iscsi and experience Ceph operations. We improved on the existing baseline by using larger, more affordable disks for storage with SSDs for journalling as well as a lot more RAM. In addition we referenced and cross-checked the Ceph hardware recommendations.

The boilerplate hardware was the same as before, using standard Thomas Krenn (TK) chassis, boards,cpus, controllers, etc. that are already in use in the Flying Circus. We provisioned the servers as usually and moved them to the data center right away on the assumption that their characteristics would be at least equal to or even better than the current setup. It turned out that our configuration did not perform as expected. We experienced a head-ofline blocking problem in the controller s write-back cache so that the strongly differing performance characteristics of SSDs and magnetic disks caused the SSDs to have to wait for any requests that were being served by the magnetic disks. As the magnetic disks were (intentionally) slower than on the previous servers this caused overall performance to drop dramatically. When noticing the bad performance we got in touch with our vendor who helped us diagnose the situation quickly. We reconfigured the cache and performed firmware updates for the affected controllers based on the vendor recommendations. SERVER HARDWARE: FAILED COMPONENTS We intended to replace our existing cluster of 7 old machines with a new cluster of 4 better machines. While migrating data to the new servers we faced failed components so that we actually only could deploy to 3 servers as one RAID controller and one disk died. This put more strain on the already busy machines while we were dealing with the performance impact mentioned above. Ceph itself was handling those errors gracefully and we did not loose any data because of those failures, but it both increased the recovery traffic overall and put more load on fewer servers. FOLLOWING THE PLAN VS. IMPROVISATION We had a ready-made plan available for the roll-out of the servers that we kept close and followed where possible. We also know that we have to improvise upon unexpected conditions. In this specific case we encounter mixed feelings about the balance that we followed this time. Visiting the data center according to schedule was helpful as we were able to fix the broken components quickly and with short communication cycles. We were also able to update our inventory after having exchanged a few servers with only remote-hands support. This helped us understand our situation and keep track of our options while moving servers around. It was also helpful to remove additional maintenance tasks from our schedule as it became clear that the situation was having a bad impact on our customers ability to use our infrastructure properly. We originally intended to deploy a new set of routers which we postponed: we installed them physically but have left them switched off for now. Additionally we moved the storage servers again to achieve a better utilisation of power, network resources, and rack capacity. Due to operational errors this caused additional recovery traffic while we were still dealing with the associated performance penalty. This point could have reduced overall impact, avoiding some downtime: either by postponing the non-critical task or by operationally handling the move correctly.

Short-term fixes During the impact and in the immediate days following the incident we performed the following actions to improve the situation: Applied RAID controller firmware updates as recommended by Thomas Krenn (DONE) Tuned the recovery options for Ceph to avoid making it unusable during recovery or backfill. (DONE) Updated all switch configuration to not use flow control for now. (DONE) Improved Ceph OSD startup configuration by starting them more slowly to avoid crushing the journal disks and thus achieve faster startup and reduced boot times. (DONE) We started improving our customer status communications using a new status page, available on http://status.flyingcircus.io. On this page we now report current status, representative performance statistics, outages, and ongoing maintenance. Customers can subscribe via e-mail, SMS, and RSS. (DONE) Improved RAID controller configuration to avoid head of line blocking and foster SSD journal performance (DONE) Increased block device caching for Ceph in QEMU-KVM processes. (DONE) Long-term improvements The following improvements will be performed in the future to avoid similar situations: Overall Ceph performance tuning (ongoing, #12589) Reduced strip size for pseudo RAID 0 disks Investigate improved disk and controller performance through JBOD (instead of pseudo RAID-0). Improve network latency (remove jumbo frames, re-enable flow control) Reduce Ceph journal size. Review recovery rate settings and perform risk-analysis for double faults. (#13295) Review different SSD usage with newer version of Ceph leveraging Linux bcache component for read- and write-caching. (#13296) Improve operations that require temporarily disabling Ceph servers to avoid unnecessary backfill traffic (#13297) Perform burn-in tests of hardware before deployment to data center (added to our provisioning checklists) Additionally the following even longer-term lessons will be applied: We will perform better hardware testing before procurement. Our vendor has already acknowledged his interest in us performing those tests and supporting our procurement processes in this way. We will avoid big bang hardware exchanges: even when it seems safe, we will adhere to a slower ramp-up following the one, few, many paradigm that we also follow in other deployment scenarios. Move to 10G ethernet to reduce latency and improve network bandwidth.