Incident Report Sunday, July 26, 2015. Incident #2015-88



Similar documents
SQUIZ SOLUTIONS. Disaster Recovery and Security October 13. Zetland House Clifton Street London EC2A 4LD

HealthcareBookings.com Security Set Up

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

Service Support. First Level Support (Tier I) provides immediate response to emergencies and customer questions. TSD - First Level Support

DISASTER RECOVERY. Omniture Disaster Plan. June 2, 2008 Version 2.0

Autodesk PLM 360 Security Whitepaper

Managed Security Services SLA Document. Response and Resolution Times

Enterprise Remote Monitoring

Network Router Monitoring & Management Services

Information Technology Security Procedures

Sheridan/Gillette College IT Help Desk Service Level Agreement

BNA FEDERAL CREDIT UNION DISASTER RECOVERY PLAN

Resolving Active Directory Backup and Recovery Requirements with Quest Software

ensurcloud Service Level Agreement (SLA)

Transformyx Service Level Agreement

MiServer and MiDatabase. Service Level Expectations. Service Definition

Service Level Agreement: Support Services (Version 3.0)

Facilities Planning and Management Services

Jacksonville University Information Technology Department Disaster Recovery Plan. (Rev: July 2013)

MSP Service Matrix. Servers

ITOPIA SERVICE LEVEL AGREEMENT

Statement of Work. 3.0 Details of Service Pre Start-up Cooling Start-up Service. Installation Service. Table of Contents. 1.0 Executive Summary

ACME Enterprises IT Infrastructure Assessment

Version: Page 1 of 5

Ohio University Office of Information Technology

Remote Services. Managing Open Systems with Remote Services

Community Anchor Institution Service Level Agreement

Symmetry Networks. Corporate Managed Services Schedule

Unitrends, Inc. Software and Hardware Support Handbook

Fully Managed IT Support. Proactive Maintenance. Disaster Recovery. Remote Support. Service Desk. Call Centre. Fully Managed Services Guide July 2007

SHARED WEB AND MAIL HOSTING SERVICE LEVEL AGREEMENT (SLA) 2010

DISASTER RECOVERY AND BUSINESS CONTINUITY

Virtual Instruments Corporation. Maintenance and Support Services Supplement. Last updated on November 26, 2013.

Oracle Maps Cloud Service Enterprise Hosting and Delivery Policies Effective Date: October 1, 2015 Version 1.0

Toronto Public Library Disaster Recovery recommended safeguards and controls

i. Maintenance of the operating system, applications, content on the server, or fault tolerant network connections

IT Support & Maintenance Contract

INFORMATION TECHNOLOGY SERVICES TECHNICAL SERVICES June 2012

Data Center & Helpdesk Services Documentation

Mary Immaculate. ICT Services. ICT Helpdesk. User Guide

Managed Storage Service Level Agreement (SLA)

Monitoring Network DMN

Server Operations Managed Servers Service Level Agreement (SLA)

What the student will need:

NOS for Network Support (903)

Shared Hosting Service Agreement. 1.0 Terminology. 3.0 Service Options. 2.0 Service Description. 4.0 Service Delivery

At a Glance. Key Benefits. Data sheet. A la carte User Module. Administration. Integrations. Enterprise SaaS

Managing and Maintaining Windows Server 2008 Servers (6430) Course length: 5 days

Gradwell VoIP Migration Issues Report

Server Monitoring & Management Services

Cloud Service SLA Declaration

North Street Global, LLC. Business Continuity Plan

Server Operations Managed Servers Service Level Agreement (SLA)

Support Policies and Procedures

APPENDIX 7. ICT Disaster Recovery Plan

RL Solutions Hosting Service Level Agreement

Windows 7, Enterprise Desktop Support Technician

BridgeConnex Statement of Work Managed Network Services (MNS) & Network Monitoring Services (NMS)

Cloud Computing. Chapter 10 Disaster Recovery and Business Continuity and the Cloud

APPENDIX 7. ICT Disaster Recovery Plan

N e t w o r k E n g i n e e r Position Description

This guide provides all of the information necessary to connect to MoFo resources from outside of the office

Windows 7, Enterprise Desktop Support Technician Course 50331: 5 days; Instructor-led

IT Assessment Report. Prepared by: Date: BRI Works East Main Street, Suite 200 Charlottesville VA

Designtech Cloud-SaaS Hosting and Delivery Policy, Version 1.0, Designtech Cloud-SaaS Hosting and Delivery Policy

Understand Backup and Recovery Methods

Exam : EX Title : ITIL Foundation Certificate in IT Service Management. Ver :

SD Monthly Report Period : August 2013

Service Level Agreement SAN Storage

APPLICATION NOTE. The DMP Software Family DMP COMPUTER SOFTWARE PROGRAMS

Facilitated By: Ken M. Shaurette, CISSP, CISA, CISM, CRISC FIPCO Director IT Services

CLOUD SERVICES FOR EMS

Course Description. Course Audience. Course Outline. Course Page - Page 1 of 12

"Charting the Course to Your Success!" MOC D Windows 7 Enterprise Desktop Support Technician Course Summary

Adlib Hosting - Service Level Agreement

The Service Provider will monitor the VM for the Customer and provide notifications on an opt in basis, which is strongly recommended.

Service Level Agreement CMC170615

Internet Filtering Appliance. User s Guide VERSION 1.2

Binary Tree Support. Comprehensive User Guide

Load Balancing Service Agreement. 1.0 Terminology. 3.0 Service Options. 2.0 Service Description. 4.0 Service Delivery

DATA CENTER SERVICE CATALOG

ICT Disaster Recovery Plan

Data Center Colocation - SLA

Disaster Ready. By: Katie Tucker, Sales Representative, Rolyn Companies, Inc

GENERIC JOB DESCRIPTION - SCHOOLS

Service Level Agreement LiIT Cloud Services Level Agreement SLA Version 2.0

Service Level Agreement Between: Computing and Informational Technology And The Finance and Business Operations Division

High Level Design. Forefront Identity Manager Global Address List Synchronization. Karen McLaughlin

DATA CENTRE DATA CENTRE MAY 2015

Module 7: System Component Failure Contingencies

Transcription:

Incident Report Sunday, July 26, 2015 Incident #2015-88 Summary Services located in the main Queen s Datacentre (Email/Calendar, PeopleSoft, Moodle, Wiki, QShare, internet connectivity, hosting services, web services, storage, authentication services and application services) were unavailable on Sunday July 26, 2015 at 11:30 am when the fire detection sensors in the main Queen s Datacentre, located in Dupuis all, activated the fire suppression system. As designed, this process cuts all power to the Datacentre. Kingston Fire Department, Queen s emergency services and It Services (ITS) staff reported on-site immediately. The Kingston Fire Department inspected the room and concluded there was no active fire but ventilation was required to restore safe oxygen levels in the room. As prescribed, ventilation began and at 2:50 pm. Environmental ealth and Safety confirmed the oxygen levels in the room were at a safe level for staff to enter. Upon entry, physical inspection and cleanup began and power was restored to the room at approximately 3:30 pm. The situation was assessed and it was determined that restoring services in the Dupuis Datacentre would take less time than moving the failing systems over to our disaster recovery site at PCVL. Following the service restoration plans in place for datacentre services, the main Queen s website was restored by approximately 3:45 pm while other systems were being restored. A hard shutdown of this nature often results in physical hardware failures (power supplies, disk drives, etc.) as well as corruption of systems, particularly for aging systems. Staff began troubleshooting and restoring equipment with services starting to come on-line at approximately 5:00 pm. All systems (production and development) were restored by Monday July 29, 2015 at 10:00 am. Facilities work also continued throughout the evening to provide cooling and power to the datacentre. At the time of entry, only one of the three air conditioning units in the datacentre was operable. The two UPS systems also required attention. In most systems outages situations, the ITS Notification Tool is used to disseminate information however, during the first response process this tool was unavailable. Communications were made available to the campus users through ITS Twitter account, @ITQueensU until the ITS Notification Tool was restored. Although the incident itself was unfortunate, both the shutdown and recovery procedures were performed as planned. This incident does highlight the need for infrastructure renewal in the datacentre by replacing aging equipment. Impact All of the Dupuis all Datacentre services were unavailable from 11:30 am until 1:30 pm. At approximately 1:30 pm Internet connectivity was restored to campus by re-routing web traffic. The main Queen s web was available at approximately 3:30 pm and the majority of services came on-line between 5:00 pm and midnight on July 26, 2015, following the ITS restoration process. The two priorities identified for campus were SOLUS, as student registration was opening for all students at 12:00 am on July 27, 2015 and Moodle, required for the end of summer term on July 27, 2015. In addition to restoring SOLUS, extra systems capacity was required to handle the open registration volume. The capacity was provisioned and available by 12:00 am. Although the service was slow for the first 30 minutes, it stabilized by 12:30 am. Moodle was available to users at 6:30 am on July 27, in time for scheduled exams. Onpremise email was available at 8:30 am July 27 along with the remainder of the application services, most in the development environment, by Monday afternoon.

The estimated incident cost for the University is approximately $65,000, in addition to over $5,000 in time costs for ITS staff to close the incident the following week. Root Cause A pinhole leak in a factory weld in the second compressor of an air conditioning unit caused mineral oil to leak under the datacentre floors. The subsequent reaction of the oil created condensation that was interpreted by the fire detectors as smoke in the area. Standard operating procedure when the fire system alarms go off is to automatically shut power down in that room in order to prevent further potential damage. Resolution The ITS team worked diligently late into the night in order to restore the facilities, servers and services to the campus community for Monday morning. Course registration was available for students by midnight and by 6:30 am on Monday, all systems were running except for on-premise email. Delays were experienced in restoring services due to some hardware failures and file corruption which is normal for a hard shutdown of systems. All services were up and running by Monday afternoon. The incident review process creates a list of work to be done in the areas of Communications, Systems and Facilities, all as a result of the incident. The intent is to ensure processes and procedures are effective and efficient. More strategically, this incident highlights the need for an investment in infrastructure renewal. Replacing aging equipment is the proactive way to limit this type of incident and, should an incident occur, speed up the recovery time. Communications (Internal) Twitter was used (@ITQueensU) through the cell network Direct telephone contact to various stakeholders ITS Escalation ITS Notification once system was restored Communications (External) Twitter was used (@ITQueensU) through the cell network Infrastructure Operations posted the outages to the notification section of the ITS webpage which also sent out an email to ITSNOTICE-L Lessons Learned There is a need to increase investment in facility equipment lifecycle in the datacentre. The ITS operating budget is stretched at the expense of proper maintenance and renewal of University s infrastructure systems. The air conditioning (AC) unit responsible for the outage is 14 years old and has begun to show signs of age. The lifecycle for these types of units ranges from 15 to 20 years. At the time of the incident the AC unit only had one of its two compressors functioning. This did not contribute to the incident but it does highlight the need to replace aging equipment. Note: there are two other AC units in the datacentre; one is twelve years old and the other is nine years old. Incident management processes and notification/communication for this type of event should be reviewed and enhanced. Although Queen s email resides in the Microsoft cloud (O365), authentication services are done through the Dupuis all Datacentre. This design requires review.

Action Items Aspect Priority Prime Due Date Update Incident response Follow up: Reviewed: Fire Department, ERC <-> ITS, ITS Call-in, Business units Create Incident Response sections in a Datacentre SOP (DC SOP) book and develop a distribution and update process. M Judy Dec 2015 Paper copy and electronic Document responsibilities and contacts for Crisis Manager and Crisis Communications role Document and communicate PPS Duty Manager contact process Document FM200 procedure to clear & test room Document process for engaging a Security resource to guard external doors when open for ventilation Enhancements to external monitoring for early detection M Terry Dec 2015 Compile information into DC M ugh Dec 2015 Compile information into DC Include with FM200 procedure M ugh Dec 2015 Compile information into DC L Judy Dec 2015 Compile information into DC Andy Complete NodePing need to add more contacts Add services monitoring for services without monitoring Andy/ Dan B Sep 2015 Events Calendar, SCOM server, terminal server RDP port Review and update (if necessary) contact and procedures documented at Campus Security Review with PPS the need for ventilation for smoke-producing work and other cutting and dust M Ray Dec 2015 M ugh Dec 2015 July 30: building fire alarm activated from soldering/welding even though B36 alarm disabled. Verify PPS procedure for this work.

Engage Business units to develop their fallback" plan if systems become unavailable Communicate and update On-Call docs making carrying the On-call phone as well as pager mandatory M Gail May 2016 Initiate with Enterprise Solutions team for Finance, uman Resources and Student Services. Judy Complete Communications Follow-up: Methods: Phone, Twitter, Web, ITSC phone message Stakeholders: Escalation list, ITS Staff, hosting customers, Queen's community osting customers - list, contact numbers Infra Managers Sep 2015 Compile information into DC External client notification (school boards, ORION) - list, contact numbers Ray Sep 2015 Compile information into DC Add social media (Twitter) to standard notification channels Communications Team Sep 2015 Compile information into DC Define the role of communications team M Communications Team Dec 2015 Compile information into DC Add EITAC to customer contact list Brad/Terr y Sep 2015 Compile information into DC Define ITS Notification procedure for incidents (send email to ITS and EITAC with more detail - Notification on ITS page) M Brad Dec 2015 Compile information into DC Verify ITSC phone message still in communication plan. Facilities PPS/ITS Follow up: Physical room access, power, AC, cabling, maintenance Develop budget strategy for overhead cable trays (infrastructure renewal) Gail Sep 2015 ~$40k tray structure ~$10k wire re-routing

Develop budget strategy for AC replacement cooling Gail Sep 2015 ~$100k/unit with 15-20 year lifecycle (infrastructure renewal) Develop budget strategy for replacement for small UPS M Gail Sep 2015 ~$100k with 20 year lifecycle (infrastructure renewal) Budget and schedule yearly DC room cleaning M ugh/gail Sep 2015 ~$10k per datacentre (~$30k) Define AC maintenance schedule and contract (PPS or external). Develop operational process for monitoring. Undertake an emergency power off design review Review AC run design as it is at maximum length Investigate options for a DC ventilation plan with PPS Document UPS procedures; post contact number on UPS M ugh Nov 2015 Unknown cost M Andy May 2016 M ugh May 2016 M ugh May 2016 ugh Oct 2015 Compile information into DC Document physical layout of fire detector on floor plan L May 2016 Compile information into DC SOP. Provide staff with FM200 training (emergency shut off) Submit PPS work order to inspect datacentre door latch to Chemical Engineer area as it blew open when fire suppression released. Schedule regular fire sensor replacement with PPS PPS to install guard by fire sensors to protect from AC belt dust Ray Oct 2015 Compile information into DC SOP. ugh Sep 2015 M ugh Dec 2015 ugh Sep 2015

Install ammeters on power bars Ray Complete To be done as they are replaced Purchase fans to ventilate and/or cool room Purchase cart to move equipment to storage Contract initial room cleaning as a result of this incident Ray Complete Ray Complete Ray Complete $10k Systems Follow up: Investigate: AD authentication servers - wireless client couldn't authenticate; AD in PCVL - test with others off Systems Sep 2015 Investigate: AD service in Fleming did not allow authentication Systems Sep 2015 Review authentication design for O365 email. Middleware Dec 2015 Routing of fibre channel network all through Dupuis Review "auto boot" feature on servers Passwords - Keypass at alternate location DNS resolver redundancy needs investigation Review recovery plan for main queens WWW server M Networks May 2016 M Systems Dec 2015 Brad Sep 2015 M Brad Sep 2015 M Brad Dec 2015 Compile information into DC Regular CMDB copy at PCVL Brad Complete Primary to PCVL, copy nightly to Dupuis

Decommission old systems to assist in restoration process. Dual power planning procedure for new installations Monitoring - mailbox monitoring, bring up SCOM, external monitor ccs-server M Brad Ongoing Older systems and applications require more manual intervention and often hard shutdowns result in corrupt data or failed equipment. M ugh Dec 2015 Andy Complete NodePing checking ccsserver sermon status page and www.queensu.ca texting Andy Sermon doing qwa login Sermon checking SCOM IIS active Update SQL servers Brad Sep 2015 Apply blade firmware bug update, change all auto power startup, system startup order Document in DC SOP all systems that require intervention. Include how to check and who to follow-up with M Brad, Terry Dec 2015 Topaz, Oscar, EDRMS, doc should be in CMDB, list of systems needing special startup attention