C. Da Rold, S. Mingay Research Note 7 November 2003 Commentary Italian Blackout Impacts IBM Image and Clients' Business IBM's data center in Vimercate failed to deliver IT services to several clients after the nationwide power cut in September. This is a wake-up call for all businesses to check if their infrastructure really is resilient. The Italian blackout of 28 September 2003 (see "Italy's Blackout Shows Enterprises Need Greater Resilience") happened at the "best time" during a weekend, at 3am on Sunday morning. It caused no major problems or public order issues. Many people worked that Sunday on contingency containment or emergency services and then forgot the event, saying, "Luckily, it wasn't a working day." But signs of problems started to appear in the middle of the following week. IBM's main data center in Italy, at Vimercate near Milan, was not delivering services to some clients with outsourcing contracts. A week later, Gartner asked IBM for clarification of the event and discussed the situation with some of the affected organizations. As a matter of due diligence, we polled other organizations (clients running their own data centers, telecommunications companies and the clients of other outsourcing providers) to check if further blackout effects had been experienced. None of them reported significant issues. On 11 October, IBM Italy provided the following response to Gartner: "The nationwide blackout of 28 September affected the IBM data center located in Vimercate, which delivers IT services to both IBM users and customers' organizations. At the time the power went out, 3:21am, the UPS system (backup battery) started to work, as expected in such cases. For reasons that are currently under investigation, overheating developed, impacting also some areas of the data center. Also, the fire alarm went off, causing the intervention of the fire brigade. Although there was no fire in the data center, the above events caused serious damage to various pieces of equipment and interruption of the delivery of certain IT services. The disaster recovery procedures started immediately. Services to IBM users were restored within 48 hours. As for the customers served in outsourcing by the Vimercate data center, disaster recovery procedures were also started in accordance with the customers' specific requirements. Gartner 2003 Gartner, Inc. and/or its Affiliates. All Rights Reserved. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.
Normal disaster recovery procedures provide restoration of services within a previously agreed time frame, on the basis of the data resulting from the last available backup. The backup is usually provided on a periodic basis, normally daily or weekly in accordance with the relevant contracts in force between IBM and each customer. A number of customers who had disaster recovery services and contracted for frequent backups followed the normal procedures of disaster recovery and were able to restart operations within the agreed time frame. However, other customers, who had disaster recovery services but did not contract for a daily backup, decided, with IBM, not to run disaster recovery services immediately on the last available backup, but to focus on restoring the most recent data, in spite of a possible longer time of recovery. In fact, they aimed at getting their operations back to the level where they were at the moment of the national blackout (including therefore the data not present in their last available backup). IBM has also worked for all those customers who did not have any disaster recovery procedure included in their contracts, in order to allow them to recover their most recent data in any case. Customers' operations started to be recovered from 29 September. IBM has been in continuous communication with the highest managerial and IT levels of its customers throughout these events. The efforts have been extraordinary, for all those involved, both IBM and its customers. IBM made available and put to work its national and international resources and capabilities (experts from its software laboratories and production centers, its support centers, and its advanced techniques and tools for data restoring). More than 400 IBM staff worked in shifts, 24 hours a day, until the restart of full operations. Furthermore, IBM installed for the benefit of its customers additional and more powerful hardware and storage (than that used before 28 September), in order to restore customers' normal operations more quickly." (Source: IBM Italy.) How Some Clients Were Affected On 30 September (the second working day after the event), an Italian bank announced to its clients that business operations based on central mainframe services, which it had outsourced to the Vimercate data center, were not operational. Branch processing (supported by local configuration and software) and point-of-sale operations (managed in-house) were available as normal. The bank only reported a return to full business operations on 7 October, nine calendar days after the blackout. Press reports indicate that two other banks suffered similar if not worse problems and described the significant difficulties experienced by customers. It was also reported that a Nestle production plant for Perugina chocolate, near Perugia, was closed and the workforce sent home for three days (from 30 September to 2 October) because of the IBM Vimercate event. This made it impossible for the client to manage the plant's logistics and warehouses. These clients' business problems have been confirmed by their customers and other indirect sources. Gartner Comment 7 November 2003 2
A real disaster affected the IBM site. It had a significant business impact on some outsourcing clients up to nine calendar days. The fault if any may lie with IBM, with its clients' disaster recovery practices, or somewhere in between. Our focus here is on the impact of this kind of event on the market and, more importantly, how to avoid it. The event raises some unanswered questions. Nevertheless, the lessons learned will affect: Italian businesses. The event will push security, disaster recovery and business continuity higher up the corporate priority list. Organizations that have outsourced services, including disaster recovery and business continuity. The event demonstrates that business risk cannot be easily transferred to providers. IBM. A logical and advisable source of data center outsourcing services, the company is left managing an embarrassing situation. The outsourcing market in Italy. We expect resistance against outsourcing to rise. Other IT services providers. Competitors may take advantage of IBM's failure to enhance their position. Another provider has already declared, "Our clients encountered no problems from the blackout, because continuity is part of our contracts and our culture." For Italian businesses, disaster recovery and business continuity practices have not been a high enough priority for many years. A separate and distinct market for these services never grew enough to allow the development of recovery specialists. These services are usually delivered by outsourcing providers or managed in-house. They are already hot topics in financial institutions, because Basel II regulations are adding new scrutiny and financial burdens to operational risks. Business dependence on IT services continuity is already strong in many other verticals, including manufacturing, where integration between enterprise resource planning and plants, and automation of the supply chain make continuity a must. Security, disaster recovery, business continuity best practices, the continuous update of these practices and plans, and recurring tests must become embedded in day-to-day business processes. Every contingency plan must include communications that limit the damage to an organization's business and image. Businesses can't just "hope nothing will go wrong." Unfortunately, sometimes it does. Client organizations must understand their provider's contractual responsibilities and which ones rest with them. Very often, the client still owns the majority of the risk. We advise organizations to ask themselves the questions in these five areas: Do you understand your own recovery needs? Do you understand the impact of service failure or loss of data on business operations? Do you understand how the impact increases over time? With this business information, you can judge the requirements and value of backup and recovery practices. You can also define the appropriate recovery time objectives (RTO) the recovery "window" and the recovery point objectives (RPO) the acceptable data loss. Without these business requirements, disaster recovery and business continuity services can look like a costly overhead to be cut. Are requirements clearly communicated to the provider? Does the service provider and the internal IT organization recognize these requirements? Are they prepared to achieve it? Does your organization accept that there is a price to be paid? 7 November 2003 3
Does your contract reflect current requirements for backup and recovery services? Are the contractual recovery plans based on your current business environment? During negotiations of service contracts, these kinds of service easily drop off the "radar screen," especially when the price is cut down through multiple negotiations. Does your change management process capture and communicate changing requirements? RTO, RPO and critical loads change over time. The test of real business continuity capability lies in the ability to capture and act on those changes. Do you test and audit recovery capability and requirements? Audits are needed to ensure the appropriate controls and practices are up and running and to provide a level of assurance that requirements and deliverables are aligned. An untested recovery plan is at best a set of good intentions. Providers and clients must work together to conduct regular tests of these capabilities. In summary, adopting best practices for disaster recovery and business continuity, enforcing contractual audit rights on the provider and carefully co-managing recovery tests are not exotic requests. They are healthy practices from both a business and a personal perspective. For IBM, many questions about its management of this event remain unanswered: Why did a nonevent precipitate a disaster? In the Milan area, the blackout lasted three hours a nonevent for a data center. Italy was in a declared state of "controlled blackout" for most of the summer. Every data center and plant should have tested its power supplies and contingency plans. IBM's other Milan data center suffered an accidental fire a few years ago, and U.S. companies are a potential target of terrorist activity in Europe. Plant maintenance, recovery plans and tests, best practices in problem containment and business continuity should have already been enforced to the letter. Were damages minimized by the first reaction? Although IBM reports there was no fire, the intervention of the fire brigade added delay and perhaps uncertainty. If there was no fire, what damaged the equipment? Was the fire brigade trained for an emergency on this site? Best practices for managing data centers aim to limit initial damages. Were contingency and recovery plans implemented quickly and efficiently? Gartner has talked with clients affected by the outage. Their view of the timing of events doesn't fit entirely with information from IBM, possibly the result of miscommunication or misunderstanding. As confirmed by IBM, a few clients requested an unplanned recovery path (using original data from the damaged center) instead of the more usual plan (48 hours, using the last data backup). Has a realistic evaluation of the time needed and the risk to customers been done? Was a quicker and sounder path to recovery expected? Were regular recovery tests done with these clients? Did IBM consultants advise clients of the risks? Was the risk to clients clearly described in their contracts? While it's clear that IBM's efforts after the event have been remarkably strong, continuity of service and disaster recovery best practices are about anticipating risk, avoiding and containing damage, and executing contractual recovery plans that have been tested. Can a leading provider give clients complete freedom on security and recovery issues? Even if all blame lies with clients, this event harmed IBM's image in key areas: mainframes, data centers, disaster recovery and business continuity services. Service providers and their clients are obviously free to define their service relationship when drawing up a contract. Nevertheless, shouldn't wise outsourcing providers and security consultants prevent their clients from operating below a certain security level, to avoid being damaged if something goes wrong? 7 November 2003 4
Has the communication plan worked well? Every continuity expert teaches that communications (internal, to clients and to external parties) are of paramount importance in an emergency. Sometimes, organizations affected by a problem elect not to talk about it, perhaps in the hope of limiting the spread of bad news. This approach does not work, because it fuels speculation and can damage a company's image unnecessarily. Bottom Line: The events at Vimercate re-emphasize the need to build security and resilience into business infrastructure whether delivered internally or outsourced. We advise every business to evaluate the potential direct and indirect damage in the event of a disaster, and to check the efficiency of its contingency plans. The events have a wider implication for the image of IBM in Europe, because a provider's disaster recovery and business continuity practices are usually strictly aligned, at least at a regional level. Clients of any outsourcing provider who are not clear about their service-level guarantees and procedures in the event of a major incident should talk to their account representatives as soon as possible. IBM must work hard to assure clients and prospects of the continuity of its services, while significantly improving its communication plans, if it is to avoid significant damage to its image in the region. 7 November 2003 5