Data center outages impact, causes, costs, and how to mitigate Data centers sometimes fail. You can build in safeguards and fail safe mechanisms and redundancy through backup systems but like all engineered systems, data centers can -- and sometimes do -- fail. See Table 1 for some of the notable data center outages of 2011 and 2012 to see how even the biggest brands with access to the best technology and resources can suffer from data center outages.
02 TABLE 1 WHO HOW LONG WHAT HAPPENED IMPACT Huffington Post, Buzzfeed, Gawker and several others Few days Water flooded data centers in New York after Hurricane Sandy Several websites and other services down Twitter Few hours Both primary and backup systems failed A well publicized campaign to encourage athletes and visitors to the Olympics to tweet was affected Salesforce 7 hours Power failure in data center CRM services to customers affected Bank of America 6 days Online banking down across U.S. 29 million users affected Amazon Web Services 4 days Amazon EC2 (elastic compute cloud) services went down Amazon Web Services 4 days Amazon EC2 (elastic compute cloud) services went down Intuit 2-4 days Customers lost access to applications such as TurboTax Online, QuickBooks Online, Quicken and QuickBase. Several thousands Google 2 days Gmail affected 120,000 users affected Blackberry 24 hours plus Unavailable worldwide Millions of users affected Yahoo 24 hours plus Yahoo Mail outage Microsoft 24 72 hours Windows Live, Hotmail inboxes disappear Verizon 24 hours plus Series of data outages Several US states unable to get LTE service Netflix 4 8 hours Netflix streaming service affected 20 million users affected 2011 2012 Notable Data Center Outages in 2011 and 2012 Source: See Ref 1, Ref 2
03 So how can businesses ensure that disruptions due to data center glitches are minimized? First, some perspective.using an outsourced data center is,in almost all cases, a whole lot more reliable and cost-effective for a company thanbuilding one in-house. That s because a thirdparty data center is able to share the very high cost of the technology, infrastructure, and personnel that go into building the data center among multiple customers. In fact, the economies of scale are so compelling that while data centers are growing in size, they are declining in numbers (see Ref 3). Which just means that more companies are outsourcing more of their IT infrastructure to third-party data centers. Second, it helps to know what makes up a data center in order to better understand what is involved in keeping it robust. What is inside a data center? A data center is a configuration of server rooms, cooling units, storage, batteries, and generators. At the core of a data center are racks and racks of servers. Servers need power, lots of it -- a typical large data center occupies 50,000 square feet of space and consumes 5 MW of power. Bringing in so much power generates massive amounts of heat. This heat is carried away by cooling units that force cool air from the floor, through the racks, and into ducts above. Data centers collect and store vast amounts of data. This data needs to be stored safely, often for several years (as in the case of financial information). The hardware for storage is therefore stored in secure locations for example, in underground mines. Since data centers run on power and utility power can fail, every data center has batteries for backup thousands of them stacked up and constantly being charged. In the event of a power failure, these battery banks provide power. But batteries can provide power only for a few minutes at most. To provide power during longer power failures and blackouts, most data centers have banks of diesel generators on standby. And since these massive diesel generators need fuel, data centers need to store thousands of liters of diesel fuel. Causes and cost of data center outage Information on data centers is hard to come by. Because data centers are critical pieces of IT infrastructure and store sensitive customer data, data center managers are fiercely protective of their privacy. Probably the first and only major survey of data center outages and costs associated with these outages are two studies by the Michigan based Ponemon Institute sponsored by Emerson Network Power. Both studies are limited to U.S. data centers but can be considered representative of the industry.
04 Datacenter outages the Indian context Outage causes In the 2011Data Center Risk Index published by hurleypalmerflatt, an engineering consultancy, and Cushman & Wakefield, a real estate consultancy, India ranked at the bottom of the 20 countries ranked in descending order of risk associated with running a data center. The U.S., Canada, and Germany were at the top of the rankings. On the face of it, this is a dismal ranking for a country that is at the center of the global outsourcing revolution. On closer look though, things are not as bad as they seem. To begin with, the Data Center Risk Index is a weighted average of 11 macro and local factors covering a wide range of attributes from the cost of energy to political instability to inflation to availability of water. Depending on their priorities and approaches to risk, individual customers will arrive at significantly different assessments of risk. The first study, National Survey on Data Center Outages, published in September 2010, surveyed 453 individuals responsible for data center operations in the U.S. Of these, 95% said they had an unplanned data center outage in the last two years. Each respondent averaged 2.48 complete shutdowns with an average downtime of 107 minutes. This was best highlighted during the world s largest power blackout when an estimated 600 million people in the northern half of India lost power for two days in July 2012. In spite of the massive disruption across several areas of the economy from public transport to industry to hospitals, there were no reports of major disruptions in data centers anywhere in India (see Ref 4). One ostensible reason is that the bulk of the data centers are located in Mumbai and the south of India while the blackout was in the northern half of India. But the real reason was that India has a chronic power problem and data centers are geared to work through intermittent, low, and no power from public utilities. Most third-party data centers have power back up for days on end it s just another risk to be managed. Apart from complete shutdowns, respondents reported far more frequent partial rack- or rowbased outages an average of 6.8 row-based outages with an average downtime of 152 minutes, and an average of 11.2 rack-based outages with an average duration of 153 minutes in a two-year period. The most frequently cited root causes of data center outage were: UPS battery failure (65%), UPS capacity exceeded (53%), human error (51%), and UPS equipment failure (49%). The most common responses to unplanned outages were to repair, replace or purchase additional IT or infrastructure equipment, followed by contacting the equipment vendor for support. TABLE 2 Data Center Resilience Tier Levels Tier 1: Basic 99.671% availability Tier 2: Redundant Components 99.741% availability Susceptible to disruptions from both planned and unplanned activity Less susceptible to disruptions from both planned and unplanned activity Single path for power and cooling distribution, no redundant components (N) Single path for power and cooling distribution, includes redundant components (N+1) May or may not have a raised floor, UPS, or generator Includes raised floor, UPS, or generator Takes 3 months to implement Annual downtime of 28.8 hours Must be shut down completely to perform preventive maintenance Takes 3 to 6 months to implement Annual downtime of 22.0 hours Maintenance of power path and other parts of the infrastructure require a processing shutdown Tier 3: Concurrently Maintainable 99.982% availability Tier 4: Fault Tolerant 99.995% availability Enables planned activity without disrupting computer hardware operation, but unplanned events will still cause disruption Planned activity does not disrupt critical load and data center can sustain at least one worst-case unplanned event with no critical load impact Multiple power and cooling distribution paths, but with only one active path, includes redundant components (N+1) Multiple active power and cooling distribution paths, includes redundant components (2 (N+1), i.e., 2 UPS each with (N+1) redundancy) Includes raised floor and sufficient capacity and distribution to carry load on one path while performing maintenance on the other Takes 15 to 20 months to implement Annual downtime of 0.4 hours Takes 15 to 20 months to implement Annual downtime of 1.6 hours
05 Outage costs The second Ponemon Institute study, Calculating the Cost of Data Center Outages, published in February 2011, surveyed 41 independent data centers in the U.S. that experienced at least one complete or partial unplanned shutdown in the previous 12 months. The survey revealed that data center outages have significant financial consequences ranging from a minimum cost of $38,969 to a maximum of $1,017,746 per organization. The average cost of a data center outage was $505,502 per incident. ($ = 55 INR). How to evaluate data center reliability Historically, data centers have been designed in the absence of established standards. This made it very difficult for network managers to choose technologies to build and benchmark data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center infrastructure. The TIA-942 standards cover site space and layout, cabling infrastructure, tiered reliability, and environmental considerations. Of these, the tiered reliability standards are directly useful to organizations looking to evaluate data center resilience across vendors. The TIA standards, based on a system pioneered by the New York-based Uptime Institute in the mid-nineties, prescribe architectural, security, electrical, mechanical, and telecommunications recommendations. There are four tiers of availability from Tiers 1 to 4, with Tier 4 being the most resilient. See Table 2 for a description of the tiers redundancy is indicated in terms of N where N represents only the necessary system need. Going up the levels has a significant cost impact -- construction costs for Tier 3, for instance, are double that for Tier 1.So organizations need to carefully determine an appropriate tier level for their different needs. ebay for example, started out with all their applications in a Tier 4 data center till they analyzed their needs more closely and determined that 80% of their equipment could be shifted out without loss of reliability search, for instance, could be in a Tier 2 center whereas databases and network backbones needed to be in a Tier 4 center. ebay says they cut their data center Capex and Opex by half by matching applications to data center tier level (see Ref 5). How to mitigate data center outages Experts recommend the following to minimize data center outages and mitigate damage: Invest in better equipment. It s tempting to save money by buying cheap but the cost of hardware failure is very high. Provide redundancy -- relying on any single machine or a single component in the core architecture is disastrous. When it comes to crucial data, never assume that someone else is automatically protecting you. Have backups. Have your data available on multiple servers in multiple data centers. Even consider having them in different geographical regions and spread between different service providers.
06 Conclusion Data center outages are real and they can cause significant loss of revenue. The frequency and duration of data center outages varies by the size of the data center. Outages become less frequent and shorter in duration as data centers increase in size. The smaller the data center the longer and more common the outages. IT equipment failure is the most expensive root cause and human error is the least expensive.but the benefits of outsourcing IT infrastructure to a third-party data center far outweigh the risks. As with all engineered systems, the risk is quantifiable and manageable. References: Major data center outages in 2011: http://www.evolven.com/blog/2011-devastating-outages-majorbrands.html Salesforce outage: http://www.informationweek.com/cloud-computing/software/salesforce-outage-followsdata-center-po/240003577 U.S. Datacenters Growing in Size But Declining in Numbers, IDC press release, 9 Oct 2012 India s Blackout, DataCenter Dynamics, Penny Jones, 31 July 2012, http://www.datacenterdynamics.com/blogs/penny-jones/india%e2%80%99s-blackout Matching applications to data center tier level: http://blog.uptimeinstitute.com/2011/07/matchingapplications-to-data-center-tier-level/ www.netmagicsolutions.com 1800 103 3130 http://blog.netmagicsolutions.com http://twitter.com/netmagic http://linkedin.com/company/netmagic The content you have downloaded has been produced with thoughtful, original research efforts by Netmagic. Please do not duplicate or misuse it. You may quote portions of our research in your own material provided you include a proper attribution to this original source. You are free to share this content on the web with friends and colleagues. 2013. All rights reserved.