Network Virtualization Platform (NVP) s ORD Service Interruption During Scheduled Maintenance June 20th, 2013 Time of Incident: 03:45 CDT While performing a scheduled upgrade on the Software Defined Networking (SDN) control cluster for Next Gen Cloud Servers in our ORD datacenter, we experienced two issues that created downtime for our customers and forced us to unexpectedly extend the maintenance window. The first issue occurred when a configuration sync flag did not fully apply to all hypervisors via the upgrade manager software deploying the cluster updates. This caused issues for customers ranging from intermittent packet loss to a few minutes of network disruption. The root cause of this problem was in the manual configuration of the automated deployment tool, not the underlying cloud network. Rackspace and vendor engineers immediately identified and fixed the issue by 3:45 AM CDT, within the original maintenance window. During the maintenance wrap-up process, Rackspace engineers discovered a component of the network configuration that was inadvertently overwritten by the upgrade. That component of the network configuration was deployed fairly recently, on May 24th, 2013, and was necessary to ensure that customer server connectivity was maintained and new server provisioning succeeded. Rackspace made the choice to extend the maintenance window by one hour in order to fix the configuration and reboot the clusters. The clusters finished syncing by 5:30 AM and then the hypervisors were able to check back in for updated flows. Any residual customer impact was confirmed complete between 5:45 AM and 6:00AM. Had Rackspace closed the maintenance window, our customers would have been exposed to potential intermittent network instability and provisioning errors until the next maintenance window was scheduled. Rackspace prides itself on the transparency of our communications. In this event, we did not live up to our standards. We believe the decision to extend the window was the right decision for our customers, but we did not clearly communicate the rationale for the decision in the manner our customers expect. Stability and uptime are paramount to our customers and to Rackspace. We apologize for the issues and the manner in which communications were handled. We are reviewing all elements of our maintenance and incident management processes to ensure that these issues do not occur again. If you have any questions, please contact a member of your support team.
ORD Cloud Server Instability June 12, 2013 Time of Incident: 10:30 AM CDT At approximately 10:30 a.m. CDT, our cloud engineers were alerted to an issue impacting services for several thousand customers within our ORD1 data center. This issue was caused when our Software Defined Network (SDN) cluster suffered cascading node failures, causing some customers to experience intermittent network connectivity, and in some cases extended service interruption, until approximately 4:30 pm CDT. The controller node failures were caused by corrupted port data from Open vswitch. The corrupted port data triggered a previously unidentified bug that caused nodes within the control cluster to crash repeatedly until the corrupted port data was identified and fixed. The cluster was repaired and customers began to come back online, with all residual effects eliminated by 4:30 p.m. CDT. The system is now stable and we are working with our SDN vendor on a permanent fix. Why did we experience issues within the Application Programming Interfaces (APIs) for both DFW and ORD? While we were experiencing service degradation in the ORD region for Next Gen Cloud Servers; Rackspace also saw availability dips in both our ORD and DFW Next Gen APIs. During this time, we experienced increased traffic in our Control Panel as customers began logging in to check their instances in ORD after the network degradation began. This caused additional load on the systems responsible for image management in both regions. Under the conditions of increased traffic, these particular databases became overloaded which translated to dips in API availability. Recent performance monitoring for those systems identified queries that could be optimized and were already scheduled for an upcoming code release. In order to fully resolve the issues in both regions, the query portions of the scheduled code release were hot patched into the environments, which restored API stability for both regions. We apologize for any inconvenience this may have caused you or your customers. If you have any further questions please feel free to contact a member of your support team.
Cloud Load Balancers s Cloud Load Balancers- ORD1 May 20th, 2013 Time of Incident: 08:27 CDT On May 20th, 2013, at approximately 08:27 CDT, our Cloud Load Balancer engineers were alerted to an issue impacting Load Balancer Nodes ztn-n09 and ztn-n10 in our ORD1 data center. The cluster containing these nodes experienced a rare capacity issue from a combination of active load balancers, new provisioning requests, and overall traffic. This caused both of the affected nodes (ztn-n09 and ztn-n10) to attempt to shift their traffic to the failover node (ztn-n12) simultaneously, which in turn affected network connectivity for the instances supported by ztnn09 and ztn-n10. After several attempts to restart services for ztn-n09 and ztn-n10, engineers determined that a reboot of all four nodes in the cluster was required to restore services. After rebooting one of the nodes, engineers discovered that the previous node failures had corrupted the global configuration files. This contributed to the inability to add the ztn-n09 and ztn-n10 back into the cluster. The corruption was corrected and the original two problem nodes were restarted and began to take traffic from ztn-n12. Services for all Load Balancers were restored at approximately 10:20 CDT. During this time customers would have experienced degraded performance, network latency, or a loss of connectivity to their instances. We apologize for any inconvenience this may have caused you or your customers. If you have any further questions, please feel free to contact a member of your support team. Cloud Load Balancers ORD1 June 6th, 2013 Time of Incident: 22:00 CDT On the evening of June 6th, 2013, Engineers performed updates to the underlying software for our Cloud Load Balancer product in our ORD1 datacenter. The updates were to resolve some recent bugs and performance issues in the application. The normal process for this work is to fail load balancers between nodes in the cluster so the load balancers remain up the entire time. During the maintenance, the performance issues manifested in a longer than normal failover time between 2 to 3 minutes. This caused some load balancers to experience packet loss, degraded performance or a brief moment of interrupted connectivity.
Additionally, around 22:30 CDT, a node in one of the clusters experienced a kernel panic as soon as load balancers were failed over to it. This caused an extended period of packet loss and/or connectivity issues for the affected Cloud Load Balancer instances of up to 20 minutes. This included Cloud Monitoring alerts and tickets being generated for some of the affected load balancers. Root Cause: The cause of the 2-3 minute fail-overs for some cluster nodes was the result of performance issues in the current version of the software being exaggerated by the effective load on the system at that time. The extended issues for a subset of customers was a kernel panic on the failover node in one of the clusters just after a group of load balancers had been moved there from their assigned node so it could be updated. Remediation: The performance of failover actions across the board was improved by the software updates. Initial recovery on the failed node was to restore it and complete the moving of the load balancers. This took about 20 minutes to complete. The software updates also contained fixes for some of the underlying causes of these kernel panics. Mitigation: We are working to fix the issues we have identified that can lead to these types of failures. We expect some of the fixes to be made available to us in the coming weeks and will begin thoroughly exercising the fixes to ensure that all issues are resolved. Furthermore, we are bringing additional hardware online in order to keep pace with the growth of the environment. Going forward, we will be scheduling these types of maintenances further in advance to give customers more notice even though these are expected to be non-impacting. We apologize for any inconvenience this may have caused. If you have any further questions please feel free to contact a member of your support team. Cloud Load Balancers ORD June 14, 2013 Time of Incident: 2:07 PM CDT On 14 June, at 2:07 PM CDT, our engineers were alerted to an issue impacting one of the Cloud Load Balancer clusters in our ORD data center. Initially, the behavior matched that of traffic based issues such as a DDoS or UDP flood, but further investigation pointed to problems with the underlying software.
The engineers focused the two nodes that were generating the most alerts and made the decision to reboot them. Normally, a reboot of one node will fail load balancers over to a redundant node but in this case the underlying software issues prevented this from happening seamlessly for all instances. Following the recovery of these two nodes, engineers began the process of rolling through the remaining nodes and rebooting them making sure to only reboot one at a time to facilitate failover and reduce impact. After rebooting a particular node at 3:28 PM, the entire cluster recovered and no further reboots were performed. Root Cause Investigation: Preliminary data indicates the issue was caused by a potential software bug. Stingray software normally syncs log and configuration information between nodes. In this case, unneeded log files were removed from one node in the cluster in response to disk space usage alerts. This is common practice to manage disk space on any system and has been performed many times on our CLB infrastructure in the past. The software had recently been upgraded on this cluster, and it appears that it reacted adversely to the logs being removed. While working, the Engineers saw a series of rsync processes building in a controlled manner on several nodes. Following the recovery, they were able to tie the behavior back to the node where the unneeded log files were removed. Details have been gathered and escalated to our vendor to confirm the bug and develop a patch or other remediation procedure. We apologize for any inconvenience this issue has caused you or your customers. If you have any additional questions or concerns, please contact a member of your support team.