June 29, 2009 Incident Review Dallas Fort Worth Data Center Review Dated: July 8, 2009



Similar documents
System Business Continuity Classification

System Business Continuity Classification

Systems Support - Extended

Chapter 7 Business Continuity and Risk Management

Service Level Agreement (SLA) Hosted Products. Netop Business Solutions A/S

Symantec User Authentication Service Level Agreement

Accident Investigation

GREEN MOUNTAIN ENERGY COMPANY

Infor EAM Mobility Initiative

Project Startup Report Presented to the IT Committee June 26, 2012

Point2 Property Manager Quick Setup Guide

Multi-Year Accessibility Policy and Plan for NSF Canada and NSF International Strategic Registrations Canada Company,

CMS Eligibility Requirements Checklist for MSSP ACO Participation

Understand Business Continuity

Nuance Healthcare Services Project Delivery Methodology

Army DCIPS Employee Self-Report of Accomplishments Overview Revised July 2012

CENTURIC.COM ONLINE DATA BACKUP AND DISASTER RECOVERY SOLUTION ADDENDUM TO TERMS OF SERVICE

POLICY 1390 Information Technology Continuity of Business Planning Issued: June 4, 2009 Revised: June 12, 2014

Audit Committee Charter. St Andrew s Insurance (Australia) Pty Ltd St Andrew s Life Insurance Pty Ltd St Andrew s Australia Services Pty Ltd

IMHU-HRM-A February 15, 2012 PAI SOP. Ft. Huachuca Personnel Asset Inventory - SOP

Information Services Hosting Arrangements

Burner Troubleshooting Guide

Mobilizing Healthcare Staff with Cloud Services

State of Wisconsin. File Server Service Service Offering Definition

In connection with the SEC's Money Market Reform proposal, DST Systems, Inc. respectfully submits our comments for your consideration.

Using PayPal Website Payments Pro UK with ProductCart

CSC IT practix Recommendations

expertise hp services valupack consulting description security review service for Linux

UNIVERSITY OF CALIFORNIA MERCED PERFORMANCE MANAGEMENT GUIDELINES

Licensing Windows Server 2012 for use with virtualization technologies

IT Help Desk Service Level Expectations Revised: 01/09/2012

A Walk on the Human Performance Side Part I

Avaya Business Continuity Plan Overview

Chris Chiron, Interim Senior Director, Employee & Management Relations Jessica Moore, Senior Director, Classification & Compensation

GUIDANCE FOR BUSINESS ASSOCIATES

Licensing Windows Server 2012 R2 for use with virtualization technologies

Volume THURSTON COUNTY CLERK S OFFICE. e-file SECURE FTP Site (January 2011) User Guide

Succession Planning & Leadership Development: Your Utility s Bridge to the Future

MaaS360 Cloud Extender

Licensing the Core Client Access License (CAL) Suite and Enterprise CAL Suite

FOCUS Service Management Software Version 8.5 for CounterPoint Installation Instructions

AP Capstone Digital Portfolio - Teacher User Guide

March 2016 Group A Payment Issues: Missing Information-Loss Calculation letters ( MILC ) - deficiency resolutions: Outstanding appeals:

PROTIVITI FLASH REPORT

Nursing Process Outline - Kim Baily RN, MSN, PhD

Internal Audit Charter and operating standards

Version: Modified By: Date: Approved By: Date: 1.0 Michael Hawkins October 29, 2013 Dan Bowden November 2013

ADMINISTRATION AND FINANCE POLICIES AND PROCEDURES TABLE OF CONTENTS

COUNSELING DEFINITIONS

FOCUS Service Management Software Version 8.5 for Passport Business Solutions Installation Instructions

Software and Hardware Change Management Policy for CDes Computer Labs

EJttilb Health. The University of Texas Medical Branch Audit Services. Audit Report. Epic In-Basket Management Audit. Engagement Number

Strategic Goal 2. Timely, Accurate, and Responsive Customer Service U.S. OFFICE OF PERSONNEL MANAGEMENT RECRUIT, RETAIN, AND HONOR

Helpdesk Support Tickets & Knowledgebase

Satisfactory Academic Progress Policy

Improved Data Center Power Consumption and Streamlining Management in Windows Server 2008 R2 with SP1

PENETRATION TEST OF THE INDIAN HEALTH SERVICE S COMPUTER NETWORK

Online Learning Portal best practices guide

Overview of the Final Requirements for Meaningful Use through 2017

How To Migrate To A Networks Dmain Name Service On A Pc Or Macbook (For Pc) On A Linux Computer (For Macbook) On An Ipad Or Ipad (For Ipad) On Pc Or Ipa (For

SaaS Listing CA Cloud Service Management

Personal Data Security Breach Management Policy

CorasWorks v11 Essentials Distance Learning

How to deploy IVE Active-Active and Active-Passive clusters

CERTIFICATION CRITERIA

Financial Accountability Handbook

What Does Specialty Own Occupation Really Mean?

NHVAS Mass Management Spot Check Checklist

Support Services. v1.19 /

MANITOBA SECURITIES COMMISSION STRATEGIC PLAN

Online Banking Agreement

Updated PT, OT, and ST Benefit Changes for Acute Services for Texas Medicaid Effective January 1, 2014

RedCloud Security Management Software 3.6 Release Notes

Customer Service Description

How To Install Fcus Service Management Software On A Pc Or Macbook

BME Smart-Colo. Smart-Colo is a solution optimized for colocating trading applications, built and managed by BME.

Malpractice and Maladministration Policy

Customer Support & Software Enhancements Policy

Datasheet. PV4E Management Software Features

DISASTER RECOVERY PLAN TEMPLATE

HP Connected Backup Online Help. Version October 2012

Workers Compensation Employee Packet

VCU Payment Card Policy

Trends and Considerations in Currency Recycle Devices. What is a Currency Recycle Device? November 2003

WORKPLACE INJURY/ILLNESS/INCIDENT INVESTIGATION & REPORTING POLICY (BC VERSION)

PCI Compliance Merchant User Guide

Transcription:

The purpse f this dcument is t capture the events and subsequent respnse t the incident that tk place in the DFW datacenter n 29 June, 2009. I. Executive Summary On 29 June, an area f the Rackspace DFW Datacenter experienced a pwer utage that impacted custmer equipment between apprximately 3:15 PM CDT and 4:15 PM CDT. Several cmpnents f the datacenter infrastructure were invlved in the utage, including the generatrs, the uninterruptible pwer supply (UPS) and several pieces f the electrical infrastructure. Immediately fllwing the restratin f pwer t the impacted area, Rackspace engaged equipment vendrs, third-party advisrs and its data center engineering team t cnduct a rt cause analysis f the events that precipitated this incident. Actins have been taken t remediate several issues. The primary issues invlved were: Failure f the fur generatrs in generatr bank A t prperly synchrnize with the UPS systems and ultimately the generatrs failed t handle the electrical lad. Failure in prtins f the electrical infrastructure, which prevented the transfer f the electrical lad between different pwer surces Rackspace believes it has identified the rt causes behind the failures. They are summarized here: Quality f pwer amng the generatrs and frm the generatrs t the UPS battery cluster caused the synchrnizatin failure f the generatrs Switches in the electrical infrastructure (i.e. Pad Munted Switch) failed t perfrm t specificatin Crrective actins have been taken n the generatrs, UPS and primary utility feeder breaker. There will be nging and preventative maintenance and remediatin activities, as well as a number f prcess and leadership changes which will further strengthen the quality f the data center infrastructure and peratins.

II. Incident Timeline, 29 June 1:51 p.m. CDT The DFW datacenter lst utility pwer t its chiller plant and UPS clusters A, B, C, D, E and F in DFW Phases I and II. All back-up systems perfrmed as designed and the electrical lad was supprted by UPS batteries while generatrs supprting the chiller plant and the UPS clusters came nline as designed. 2:35 p.m. CDT The A bank f generatrs that supprts UPS clusters A, B, and E began experiencing a synchrnizatin failure. This cnditin escalated and subsequently prevented the generatrs frm hlding the electrical lad fr UPS clusters A, B and E. The Facility Engineering team culd nt reslve the generatr bank A issues and attempted t engage the secndary utility feeder. This attempt failed due t an issue in the Pad Munted Switch (PMS), which allws the transfer t the secndary utility feeder. When the A bank generatrs failed, the UPS batteries came back n line autmatically t supprt the electrical lad fr UPS clusters A, B and E until apprximately 3:15 PM. The B bank f generatrs and the chiller plant generatrs cntinued t perfrm as designed withut custmer interruptin. During this time, Facility Engineering had begun the prcess f by-passing UPS clusters A and B as they believed generatr bank A wuld supprt clusters A and B withut experiencing synchrnizatin issues if cluster E was left ff line. This prcess invlved ruting pwer acrss a maintenance bus. The maintenance bus isn t used fr utility pwer. While the team was wrking n the maintenance bus transfer, utility pwer was returned t the primary utility feeder. 3:15 p.m. CDT The batteries in UPS clusters A, B and E cmpletely discharged and all equipment receiving pwer thrugh thse UPS clusters experienced an interruptin in service. This ccurred because we culd nt switch ver t the secndary utility feeder and because we culd nt stabilize the generatrs supprting UPS clusters A, B r E. All ther parts f the data center cntinued t perate withut impact, including the chiller plant and UPS clusters C, D and F. Phase III f the DFW data center was nt impacted. 3:58 p.m. CDT The primary utility feeder was restred and UPS cluster E came back up, in by-pass autmatically. At this pint, because the lad required frm the A bank generatrs n lnger included cluster E the Facility Engineering team was able t successfully bring UPS clusters A and B back up, n generatr pwer, withut battery back-up. 5:37 p.m. CDT Batteries fr UPS clusters A, B and E were recharged and cluster B was mved t back primary utility pwer. At this time, cluster A required a repair t ne f the fur UPS mdules (mdule 2) s the cluster remained n generatr pwer. Only 3 mdules are required fr UPS cluster A t perfrm. 7:40 p.m. 7:45 p.m. CDT The A bank generatrs nw supprting nly UPS cluster A began t experience synchrnizatin issues, frcing a transitin back t utility prir t the repairs n UPS cluster A, mdule 2 being cmpleted. 8:55 p.m. 9:00 p.m. CDT Repairs t mdule 2, in UPS cluster A, were cmpleted and tested and the mdule was brught back n-line t restre N+1 redundancy in UPS cluster A.

III. Diagnsis and Remediatin Rackspace has undertaken a number f repairs and practive maintenances in respnse t this incident. This is a summary f cmpleted activities and future maintenance plans: Generatr and UPS Systems: Testing and Findings The disruptins experienced n 29 June were the result f the 4 generatrs in bank A failing t synchrnize and ging int an unacceptable excitatin state which led t the A bank generatrs failing t carry the electrical lad. Rackspace has narrwed the rt cause f the excitatin failure t pwer quality. When pwer quality is pr, the vltage and current are nt aligned, causing stress n the generatrs and ultimately the inability t synchrnize prperly. Rackspace cnducted a number f maintenances and tests n the generatrs and UPS systems t address the pwer quality issues. The fllwing are the key maintenance activities Rackspace cnducted: 1. During ff-peak hurs n 30 June thrugh 1 July, Rackspace and its vendrs placed pwer meters n the input side f all UPS cluster systems fr generatr banks A and B, in rder t cllect and cmpare detailed pwer infrmatin. Data was cllected fr bth utility and generatr settings n the UPS clusters t identify any discrepancies. 2. Incrrect settings n the UPS input filters were fund t be a cntributing factr t the pwer quality and synchrnizatin issues between the UPS systems and the generatrs. The DFW generatr bank A UPS units were previusly prgrammed t drp ut all input filters any time the generatrs are in use, causing additinal stress n the generatrs. These settings were fund t be incnsistent with the settings fr generatr bank B s UPS units. Measurement f the pwer quality f the UPS units cnfirmed that there was a high amunt f distrtin in the UPS pwer input fr generatr bank A which supprts UPS clusters A, B and E. 3. During ff peak hurs n 2 July, Rackspace s vendr lad tested generatr bank A supprting UPS clusters A, B and E t validate generatr perfrmance under simulated lad cnditins. During the maintenance, Rackspace was able t take detailed readings n the cntrl parameters and lad sharing settings. The vendr discvered an unusually high amunt f nise n the electrical lines that send signals t the fur generatrs in bank A, instructing them hw t share the electrical lad. Upn investigatin, it was discvered that firmware amng the three generatrs was incnsistent, due t a maintenance failure. Additinally, the wrng kind f wire had been used t cnnect the generatrs and the wire was als incrrectly grunded. The resulting failure f the A bank generatrs was the result f the accumulatin f the factrs described abve, exacerbated by the recent and significant use f the A generatr bank. Generatr and UPS Systems: Remediatin Actins 1. The lgic cntrlling the input filters fr the UPS systems was mdified s that filters will be applied r remved as the lad n the UPS increases r decreases, respectively. 2. The cntrls fr generatr bank A were re-wired and grunded. 3. Firmware was updated fr all generatrs in bank A as per the manufacturer s specificatin.

Fllwing these changes, testing n 3 July shwed that all f the abve issues had been eliminated and that the generatrs and UPS systems were functining prperly. Due t the successful utcme f the maintenance n generatr bank A, Rackspace practively updated generatr bank B. Pad Munted Switch The functin f the Pad Munted Switch is t direct pwer frm the utility feeder t each transfrmer which leads t the Datacenter. The #8 Pad Munted Switch failed when the datacenter team attempted t mve frm the failing generatr bank t the secndary utility feeder. During the investigatin it was discvered that the mechanical linkage in the Pad Munted Switch wuld nt engage despite the fact that the switch had been maintained in accrdance with manufacturing guidelines. Remediatin Plan: Parts are n rder and are expected t be available during the week f 6 July. Repairing the switch will require that Rackspace mve UPS cluster F, supprted by generatr bank B, t generatr pwer during a scheduled maintenance windw. This repair has tentatively been scheduled during the latter half f the week beginning 6 July. Utility Breaker Maintenance During the event investigatin, the Facility Engineering team discvered that the utility feeder breakers had nt been adjusted t prperly reflect the increased pwer requirements f the facility which had recently been increased t 12MW. Remediatin Actins: During ff-peak hurs n 3 July, Rackspace s utility prvider adjusted the settings f the feeder breaker fr bth primary and secndary feeders t better match the pwer requirements f the facility. Systemic Operating Changes Thrughut the curse f recent events, Rackspace has taken steps t remediate the current issues and mitigate future issues by leveraging third party experts and vendrs, making rganizatinal changes and evaluating the efficacy f prcesses used in data center peratins. That wrk will cntinue ver the cming mnths, hwever, the fllwing systemic changes have been made: Resurces Leveraged in Diagnsis and Remediatin Activities: Eatn Electrical Electrical Infrastructure Vendr Cummins Generatr Vendr Oncr Utility Prvider Eatn Pwerware UPS Battery Cluster Vendr Cnsultant frm M.I.T Cnsultant frm a tier-ne datacenter peratr Rackspace Facility Engineers frm DFW and ther Rackspace data center facilities Operating Changes Ging Frward (Cmpleted r In Prgress): Organizatin Changed DFW data center and US data center leadership Prcess Reviewing all internal maintenance recrds and prcesses t ensure Rackspace meets and/r exceeds the manufacturer s recmmended maintenance plan fr infrastructure

System Changing ur internal change management plicies t ensure multiple parties are invlved in the design, review and acceptance f future maintenance activities Engaging the equipment manufacturers and service prviders t inventry previus maintenance activities and prvide frward-lking maintenance plans Imprving the cmmunicatin prcess t Custmers regarding standard datacenter maintenance events and incidents As referenced in the abve dcument, Rackspace, in additin t the immediate incident respnse, has practively replaced several cmpnents f the datacenter infrastructure (i.e., UPS battery clusters) IV. Dallas/Ft. Wrth Data Center Pwer Structure Overview The data center is cmpsed f several key pieces f equipment and assciated electrical infrastructure. Majr cmpnents are described belw: DFW Phases I and II have tw separate utility feeds (primary and secndary). The utility feeds are n separate sub-statins. Shifting between the tw utility feeds is accmplished by the Pad Munted Switches DFW Phase I and II are supprted by tw separate banks f generatrs that pwer the UPS clusters. Changing between utility feed and generatr feed is accmplished by the Autmatic Transfer Switch (ATS). Generatr bank A and the primary utility feed deliver pwer t an Uninterruptible Pwer Supply (UPS) with three UPS clusters (A, B and E) via a pwer bus. Generatr bank B and the primary utility feed deliver pwer t an Uninterruptible Pwer Supply (UPS) with three UPS clusters (C, D and F) via a pwer bus. The Uninterruptible Pwer Supply (UPS) clusters are batteries that maintain pwer during the shift between utility and generatr pwer. The UPS clusters target 20-40 minutes f pwer in the absence f utility r generatr pwer Each UPS cluster has fur mdules Each mdule has fur strings f batteries Each string has frty batteries Each UPS cluster has between ten t furteen Pwer Distributin Units (PDUs). Each PDU supprts apprximately frty cabinets, which include servers and ther devices. Cabinets have redundant cnnectins t separate PDUs. This data center infrastructure has been built with redundancy t allw fr standard peratin in the event f failed cmpnents. Rackspace can lse ne utility feed and still be pwered by generatrs r the ther utility feeder (N+1) Rackspace can lse ne generatr in the generatr bank and still pwer all three UPS clusters (N+1) Rackspace can lse ne mdule in any individual UPS cluster and still pwer the attached PDUs and assciated cabinets (N+1)