Gradwell VoIP Migration Issues Report For Gradwell Customer and Partners With Compliments June 2013, V.1.0 Draft
VoIP Migration Issues Report Table of Contents 1. PURPOSE OF DOCUMENT...3 2. HIGH LEVEL OVERVIEW...4 3. THE WAY FORWARD...6 2 www.gradwell.com 01225 800 800 info@gradwell.com
VoIP Service Outage Report 1. Background & Purpose of document Gradwell completed a migration to our new VoIP platform on Thursday 6 th June 2013. Whilst the migration was successfully completed by 07:00, we experienced further problems from 09:20 until mid-afternoon which were solely related to the process of migration. The purpose of this report is to provide further detail and clarification of the events which impacted the migration of services to our new VoIP platform on 6 th June, 2013. It must not be considered to be a comprehensive technical document. This report includes a high level overview of the problem experienced, details of the root cause and action taken to reduce the risk of recurrence. www.gradwell.com 01225 800 800 info@gradwell.com 3
VoIP Migration Issues Report 2. High Level Overview Over the last 2 years Gradwell has been working on a new VoIP platform, with the aim to install in on our new datacentre hardware in Telehouse West as part of our 1.2 million investment. In the last 3 months we have been implementing a safe launch program to minimise service disruption. This included capacity management and customer experience tests through simulated VoIP calls and alpha tests with Customers and Partners. Our engineering team developed a testing system which created over 3 times our expected call volume for a sustained period, while at the same time testing the redundancy of the platform by removing core servers from the platform. Having moved approximately 25% of the normal customer load we concluded that the time was right to complete the migration. On Thursday 6 th June 2013 we began the change at 06:30 by migrating our DNS records for the VoIP system, and redirecting traffic on the previous SIP routers to the new system. Due to the potential impact of the change, we had additional Customer Services, Engineering and Infrastructure staff available to support this migration. We had completed all implementation by 07:00 and continued to monitor and test the platform. At 09:20 our internal monitoring identified an issue with all calls and registrations failing on our platform. After complex investigation we identified the root cause as being an internal fraud protection feature. This incorrectly identified our new VoIP platform as attempting to perform a fraudulent attack and subsequently firewalled it. An emergency change was implemented to correct this at 10:16, enabling service to be fully restored. At 11:30 we were alerted by customers that they were seeing inbound and outbound calls failing intermittently. A detailed analysis of the platform was performed and identified a collection of enhancements which were then implemented in a controlled and timely manner. These comprised of: 4 www.gradwell.com 01225 800 800 info@gradwell.com
VoIP Service Outage Report Increased processes available to OpenSIPS on our SIP proxies Increased available MySQL connections available to our SIP proxies Changed the IO scheduler to increase performance on the SIP proxies Changed the OS storage for our SIP proxies to use local SAS disks to increase performance Moved the routing of our VoIP platform servers on a different switch to reduce a potential networking issue Applied a patch to our SIP registration servers to stop customer s phones registering too many times After 13:20 we saw a significant decrease in the amount of problems customers were reporting after a few of the changes were completed. By 17:00, all the enhancements had been successfully installed and verified as working. We have subsequently worked with customers who have been unable to connect to our system, and these appear to be exclusively related to customer firewall issues. www.gradwell.com 01225 800 800 info@gradwell.com 5
VoIP Migration Issues Report 3. The way forward As always, we will be holding a Lessons Learned review. Progress of the actions from the Lessons Learned are regularly reviewed and managed by our Problem Resolution Manager. Gradwell sincerely apologises for any inconvenience caused by the issues seen following the migration of services, and want to reassure our customers that the objective behind the rebuild and migration is to move our technology forward and provide a high quality service. Having now moved onto the new platform we are confident that our legacy capacity and scaling issues have been left behind. 6 www.gradwell.com 01225 800 800 info@gradwell.com