Internet2 Post-Incident Report

Similar documents
Internet2 AL2s 6 Month Report through

Cisco Change Management: Best Practices White Paper

Network Virtualization Platform (NVP) Incident Reports

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

Planning and Administering Windows Server 2008 Servers

Managed Security Services SLA Document. Response and Resolution Times

Troubleshooting and Maintaining Cisco IP Networks Volume 1

Community Anchor Institution Service Level Agreement

Network Monitoring and Management Services: Standard Operating Procedures

Empowering the Enterprise Through Unified Communications & Managed Services Solutions

HELP DESK MANAGEMENT PLAN

perfsonar Multi-Domain Monitoring Service Deployment and Support: The LHC-OPN Use Case

APM Support Services Guide

BridgeConnex Statement of Work Managed Network Services (MNS) & Network Monitoring Services (NMS)

Report of Independent Auditors

Managing and Maintaining Windows Server 2008 Servers (6430) Course length: 5 days

Intrado Inc. (as successor in interest to Connexon Telecom Inc) Support Policy

Northland Communications Dedicated Internet, Cloud and MPLS Services Service Level Agreement (SLA) 02/07/2013

Operating and Service Level Objectives

Remote Management Services Unified Communications Addendum

M6430a Planning and Administering Windows Server 2008 Servers

GÉANT Open Service Description. High Performance Interconnectivity to Support Advanced Research

Planning and Administering Windows Server 2008 Servers

GMI CLOUD SERVICES. GMI Business Services To Be Migrated: Deployment, Migration, Security, Management

CNE Network Assessment

Customer Evaluation Report On Incident.MOOG

Planning and Administering Windows Server 2008 Servers

Virtual Machine Manager Domains

SERVICE LEVEL AGREEMENT

July Brennan IT Voice and Data. Service Level Agreement

Statement of Service Enterprise Services - AID Microsoft IIS

N-Wave Networking Services Service Catalog

Managing and Maintaining Windows Server 2008 Servers

Operating Level Agreement By Network Department For Provided Services

Kaspersky Lab Product Support. Enterprise Support Program

NNMi120 Network Node Manager i Software 9.x Essentials

The remedies set forth within this SLA are your sole and exclusive remedies for any failure of the service.

HP Software-as-a-Service

10 Tips to Better Manage Your Service Team

Statement of Service Enterprise Services - MANAGE Microsoft IIS

TXT Power Service Guide. Version

Contact Center Technology Monitoring

What is the Purpose of OA s Enterprise IT Helpdesk Procedure?... 2 How will Problems, Questions or Changes be entered?... 2

Technical Support, Software License and Maintenance Terms for Enterprise 911 Products and Services

Monitoring the Switch

Integration Guide. Help Desk Authority, Perspective and sl

mbits Network Operations Centrec

NYSED DATA DASHBOARD SOLUTIONS RFP ATTACHMENT 6.4 MAINTENANCE AND SUPPORT SERVICES

OMNITURE MONITORING. Ensuring the Security and Availability of Customer Data. June 16, 2008 Version 2.0

Data Center Colocation - SLA

Operations Communications Plan (OCP)

Opengear Technical Note

ProSAFE 8-Port and 16-Port Gigabit Click Switch

Der Weg, wie die Verantwortung getragen werden kann!

INCIDENT MANAGEMENT SCHEDULE

Clarity Assurance allows operators to monitor and manage the availability and quality of their network and services

RATIONALIZING THE MULTI-TOOL ENVIRONMENT

Implementing Microsoft Windows 2000 Clustering

datacenter networking

Outline. Institute of Computer and Communication Network Engineering. Institute of Computer and Communication Network Engineering

Ongoing Help Desk Management Plan

Best Practices For Assigning First Call Responsibilities For Healthcare Networking Issues

How to integrate Verax NMS & APM with Verax Service Desk

Systems Support - Standard

Enterprise UNIX Services - Systems Support - Extended

Lab Configure IOS Firewall IDS

Customer Support Policy

Disaster Recovery Design Ehab Ashary University of Colorado at Colorado Springs

A FAULT MANAGEMENT WHITEPAPER

January Brennan Voice and Data Pty Ltd. Service Level Agreement

DOSarrest External MULTI-SENSOR ARRAY FOR ANALYSIS OF YOUR CDN'S PERFORMANCE IMMEDIATE DETECTION AND REPORTING OF OUTAGES AND / OR ISSUES

Advantech WebAccess Device Driver Guide. BwSNMP Advantech WebAccess to SNMP Agent (Simple Network Management Protocol) Device Driver Guide

PATCHING WINDOWS SERVER 2012 DOMAIN CONTROLLERS. Prepared By: Sainath K.E.V MVP Directory Services

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Course Syllabus. Planning and Administering Windows Server 2008 Servers. Key Data. Audience. At Course Completion. Prerequisites. Recommended Courses

Unit Network Infrastructure Maintenance. Standard Service Agreement (SA)

Job Aid: Manual Backup Servers for S8500 or S8700 Media Servers R2.0 or later

Capacity Management Plan

Datasheet: Visual Performance Manager and TruView Advanced MPLS Package with VoIPIntegrity (SKU 01923)

Customized Cloud Solution

Load Balancing Service Agreement. 1.0 Terminology. 3.0 Service Options. 2.0 Service Description. 4.0 Service Delivery

Job Description. HP Advanced Solutions Inc. Position Title: Senior Database Administrator Classification: IS27

CLOUD SERVICES FOR EMS

How To Use A Help Desk With A Pnettrap On A Pc Or Mac Or Mac (For A Laptop)

Analytics Reporting Service

Lumos Parallel Network Operations Centers: Protected Network Monitoring

Disaster Recovery Exercise Report: Exercise conducted November 9, 2013

Cisco Unified Computing Remote Management Services

F5 BIG DDoS Umbrella. Configuration Guide

Configuring PROFINET

Real-World Insights from an SDN Lab. Ron Milford Manager, InCNTRE SDN Lab Indiana University

Custom Application Support Program Guide Version March 02, 2015

Information Services. Standing Service Level Agreement (SLA) Firewall and VPN Services

Evaluating Internal and Outsourced Models for Network Monitoring

Position No. Job Title Supervisor s Position Call Centre Support Supervisor Manager GN Service Desk

Relationship between SMP, ASON, GMPLS and SDN

Shared Hosting Service Agreement. 1.0 Terminology. 3.0 Service Options. 2.0 Service Description. 4.0 Service Delivery

Cisco Remote Management Services Delivers Large-Scale Business Outcomes for Cisco IT

TECHNICAL SUPPORT. and HARDWARE/SOFTWARE/NETWORK MAINTENANCE. for STUDENT-ACCESS LABS

Statement of Service Enterprise Services - AID Servers: Windows/Linux/Unix Network Infrastructure: Switches/Routers/Firewall/Wireless Access

Transcription:

Internet2 Post-Incident Report February 2014 Houston AL2S Packet Loss Incident Date: 12 February 2014 Overview The Internet2 Network experienced a degradation of circuits passing through the AL2S Core Node in HOUH (AL2S sdn-sw.houh) affecting all layers and LEARN. Recommended Actions Better understand AL2S VLANs affected by outages During this incident, getting a list of affected VLANs/customers was difficult. There is a complex manual process today to show VLANs that didn t failover when the backbone interfaces were shut down. We should ensure we have this information at our fingertips to understand the scale of service impact better. Estimated Completion Date: pending Completion Notes: Status: Pending Limit amount of time allowed for vendor diagnostics, improve Service restoration guidelines Multiple steps were taken to mitigate the service impact of the incident along the way. However, too much time was given to Brocade to run diagnostics before making the decision to reboot the switch. We should put additional guidelines for engineers in place so they will know how long a service can be impacted before escalating to more disruptive service restoration steps. It s been proposed that we target 1 hour for full service restoration, but this should be discussed with Internet2 service owners and Interne2 NOC to determine the right guidelines. Estimated Completion Date: 2/20/14 Status: Completed Completion Notes: Guidelines posted on internal NOC pages. 1 hour for restoration as a guiding principal unless overruled by Internet2 senior staff. Improve speed of vendor- provided diagnostic scripts The initial data collection script Brocade has provided the Internet2 NOC took 50 minutes to complete its data collection. If we are to move to a new guideline restore

service in an hour, this should be improved. It s unreasonable to have a diagnostic script that takes this long. Estimated Completion Date: 3/15/14 Status: In Progress Completion Notes: New set of scripts was delivered by Brocade in late February and tested against the Brocade lab equipment. They don t show any material improvement in speed unless they are run local to the Brocade switch. Internet2 systems is working to verify co-existence of the troubleshooting scripts on the local performance assurance nodes to ensure that they don t disrupt the other functions of the local server. Improvements in software/network engineer involvement While both software and network engineers engaged quickly in this particular case, the current SOP doesn t ensure this, because issues are passed to one or the other according to specific criteria. The Internet2 NOC should improve this, so that the first action of a software or network engineer is to engage the other group unless an issue is clearly one or the other. This will lead to faster and more frequent joint troubleshooting. Estimated Completion Date: Pending Completion Notes: Status: Pending Monitor Fabric module outages Before this incident, there were no alarms for syslog events related to switch fabric module outages. This was added already, and completed on 2/6. Estimated Completion Date: 2/6 Status: Completed Completion Notes: Alarming was added to Internet2 NOC systems on 2/6 and will continue moving forward. Non-Recommended Actions All items in the Internet2 NOC Post Mortem are approved

Appendix A: Vendor/Partner Post Mortem Details The following text was authored by Internet2 s vendor or partner responsible for the initial post mortem. It is provided in its entirety except in cases where it would expose sensitive information. Event Summary Post- Incident Report Internet2 AL2S 1316 Date: 2/5/2014 Degradation of circuits passing through Core Node HOUH (AL2S sdn- sw.houh) affecting all layers and LEARN. Postmortem Trigger Internet2 Director of Operations and Engineering requested Related Tickets or Network Owner Feedback I2 IP I2 IP 22312 Degraded Condition Resolved - I2 IP Backbone HOUS- KANS 22318 Outage Resolved - I2 IP Connector LEARN Systems 22317 Request - I2 add a regular expression match for 'Fabric Monitoring' to GNARL I2 TR- CPS I2 TR- CPS 5002 Outage Resolved - I2 TR- CPS Various Customers 5007 Outage Resolved - I2 TR- CPS Customer LEARN I2 Optical 2894 Brief Outage Resolved - I2 Optical Member ESnet Circuit ESNET- HOUS- KANS- 100GE- 07831

I2 Optical 2895 Outage Resolved - I2 Optical Various Circuits (HOUH/HOUS) OneNet Ticket 6236 Stability - OneNet Peers Internet2 AL2S and TR- CPS Parties involved Internet2 NOC Network Engineering Brocade TAC Internet2 NOC Software Engineering Internet2 NOC Service Desk Impact All Traffic traversing AL2S Core Node HOUH was degraded Root Cause The root cause is still under investigation by Brocade at the compiling of this report. Repair & Service Restoration action Core Node HOUH was rebooted, which cleared the immediate event, but was not thought to be a resolution or fix to the events long- term. Prevention Recommendations Methods to prevent the problem can t be determined until root cause is determined. Impact Reduction Recommendations 1.) Better understand AL2S VLANs affected by outages: during this incident, getting a list of affected VLANs/customers was difficult. There is a complex manual process today to show VLANs that didn t failover when the backbone interfaces were shut down. We should ensure we have this information at our fingertips to understand the scale of service impact better.

2.) Limit amount of time allowed for vendor diagnostics, improve Service restoration guidelines: Multiple steps were taken to mitigate the service impact of the incident along the way. However, too much time was given to Brocade to run diagnostics before making the decision to reboot the switch. We should put additional guidelines for engineers in place so they will know how long a service can be impacted before escalating to more disruptive service restoration steps. It s been proposed that we target 1 hour for full service restoration, but this should be discussed with Internet2 service owners and Interne2 NOC to determine the right guidelines. 3.) Improve speed of vendor- provided diagnostic scripts: The initial data collection script Brocade has provided the Internet2 NOC took 50 minutes to complete its data collection. If we are to move to a new guideline restore service in an hour, this should be improved. It s unreasonable to have a diagnostic script that takes this long. 4.) Improvements in software/network engineer involvement: While both software and network engineers engaged quickly in this particular case, the current SOP doesn t ensure this, because issues are passed to one or the other according to specific criteria. The Internet2 NOC should improve this, so that the first action of a software or network engineer is to engage the other group unless an issue is clearly one or the other. This will lead to faster and more frequent joint troubleshooting. 5.) Monitor Fabric module outages: before this incident, there were no alarms for syslog events related to switch fabric module outages. This was added already, and completed on 2/6. Timeline Reporting/Monitoring 2/5/2014 8:39am ET (1339 UTC) Alertmon displays Smokeping alarms related to the AL2S Houston Core node (HOUH). These alarms indicated Discards and Loss. 8:50am ET (1350 UTC) SD Creates AL2S ticket 1316:135 to work this issue. It is opened at 1- Critical. This ticket is assigned to Net Eng. SD tech contacts the Net Eng assigned to the ticket by phone and hands the issue directly to them. Investigation/Verification

9:10am ET (1410 UTC) Network Engineering contacts Software Engineering about possible issues related to AL2S. 9:15am ET (1415 UTC) SD tech advisement to SD I2 Supervisor of the issue per Critical status procedure. 9:33am ET (1433 UTC) Software determines that OESS has received a link migration event from the HOUH switch. Neteng and Software force diff HOUH to restore proper forwarding; first OE- SS re- sync. 9:51am ET (1451 UTC) Brocade was contacted to open a ticket opened to work with Brocade TAC. 9:58am ET (1458 UTC) Net Eng indicates that this is likely to be a hardware issue with Brocade. This is due to the fact that they are reporting that every port is seeing discards except for ports that are in Slot 1 of the Core Node. Internet2 Director of Operations and Engineering advises that this is a Priority 1 case with the vendor, Brocade. 9:59am ET (1459 UTC) Notification to the community is sent out regarding the issue. 10:07am ET (1507 UTC) Systems Engineering continues to work with Network Engineers to resolve the problem. Software notices another link migration on the HOUH switch. 10:16am ET (1516 UTC) Official start time of the Brocade ticket, created by Brocade engineers. 10:45am (1545 UTC) Brocade TAC initiated conference call with Internet2 NOC engineers

10:53am (1553 UTC) Internet2 NOC Network engineers disable SFM4. 11:11am ET (16:11 UTC) Neteng and Software force the HOUH switch to diff restoring proper forwarding. Second re- sync. 11:25am ET (1625 UTC) Software notices another link migration on the HOUH switch. Software asks permission to disable discovery until event has been resolved. Request is approved by Internet2 Director of Operations and Engineering. 11:49am ET (16:49 UTC) Internet2 NOC network and software Engineers force the switch to diff, restoring proper forwarding temporarily. Third re- sync. 11:50am ET (1650 UTC) Interne2 NOC enginer and Brocade believe there may be a switch- fabric issue. 12:13pm ET (1712 UTC) Examination of the logs provides that 7 customer VLANs were affected. 12:22pm ET (1722 UTC) Notification sent to the community with an update on the status of the issue. Service Restoration/Resolution Layer2 Backbone Restoration Time 10:22am ET (1522 UTC) Internet2 IP network traffic is moved to 10G circuits. At this point IP traffic is no longer affected by the incident. Layer2 traffic is still impacted. AL2S Circuits Restoration Time 12:13pm ET (1713 UTC)

With authorization of Internet2 Director of Engineering and Operations, all backbone interfaces are disabled. At this point VLANs with backup paths are no longer impacted. VLANs that were explicitly configured without a backup path are still impacted, as well as VLANs where HOUH was the edge. AL2S Node Restoration Time 2:25pm ET (1925 UTC) AL2S NPT indicated that Core Node HOUH is to be rebooted. This reboot resolve the customer affecting nature of the issue. Once it was clear that the forwarding issues were resolved, the ports that were admin- down were brought back into service to fully resolve the outage. At this point, all services were restored. Problem resolution with Brocade continued. 3:26pm ET (2026 UTC) Systems/Software re- enable discovery in OESS, and determine it is now functioning normally 5:21pm ET (2026 UTC) Brocade TAC conference call ended, without root cause identified. 2/10/2014 8:41pm ET (0141 UTC) Brocade requests and is provided additional Flow log data from the NOC; RFO still unknown. SSH@sdn- sw.houh#show openflow flows e 3/1 Total Number of Flows associated with the port: 60 Flow ID: 7 Priority: 32768 Status: Active Rule: In Port: e3/1 In Vlan: Tagged[102] Action: FORWARD Out Port: e15/2, Tagged, Vlan: 111 Statistics: Total Pkts: 0 Total Bytes: 0 <snipped remaining>

Appendix B: Document Revision History Date Change 2/12/14 Initial Document creation