Generic On-Line Diagnostics

Similar documents
Cisco IOS XR Diagnostics

Cisco Integrators Cisco Partners installing and implementing the Cisco Catalyst 6500 Series Switches

Next-Generation Switch/Router Diagnostics and Debugging

Managing Storage Services Modules

Choosing Tap or SPAN for Data Center Monitoring

Panel 2 Self Management: Separating Facts from Fiction

Configuring the Cisco IOS In-Service Software Upgrade Process

Configuring Flexible NetFlow

Cisco Catalyst 6500 High Availability: Deploying Redundant Supervisors for Maximum Uptime

Troubleshooting the Firewall Services Module

Consolidated Packages and SubPackages Management

Cisco Nexus 7000 Series Supervisor Module

Stateful Switchover (SSO)

CT5760 Controller and Catalyst 3850 Switch Configuration Example

Configuring Right-To-Use Licenses

Network Analysis Modules

Seminar Seite 1 von 10

Cisco Catalyst 6500-E Series Switch as the Backbone of a Unified Access Campus Architecture

Troubleshooting and Maintaining Cisco IP Networks Volume 1

Route Processor. Route Processor Overview CHAPTER

New Features in Cisco IOS Software Release 12.2(33)SXI2

12K Support Training 2001, Cisc 200 o S 1, Cisc 2, Cisc y tems, Inc. tems, In A l rights reserv s rese ed.

Architectural Framework for Self-Managed Networks with Fault Management Hierarchy. Mehmet Toy, Ph.D

Embedded Event Manager Overview

Troubleshooting the Firewall Services Module

Cisco Certified Network Associate Exam. Operation of IP Data Networks. LAN Switching Technologies. IP addressing (IPv4 / IPv6)

Cisco NetFlow Generation Appliance (NGA) 3140

Cisco Application Networking Manager Version 2.0

FWSM introduction Intro 5/1

Read Me First for the HP ProCurve Routing Switch 9304M and Routing Switch 9308M

Administering the Network Analysis Module. Cisco IOS Software. Logging In to the NAM with Cisco IOS Software CHAPTER

Configuring PROFINET

CCT vs. CCENT Skill Set Comparison

CiscoWorks Resource Manager Essentials 4.3

Catalyst 6500/6000 Switches NetFlow Configuration and Troubleshooting

High Availability Failover Optimization Tuning HA Timers PAN-OS 6.0.0

NX-OS and Cisco Nexus Switching

For Sales Kathy Hall

Configuring NetFlow on Cisco ASR 9000 Series Aggregation Services Router

Configuring Redundancy

Configuring the Switch for the Firewall Services Module

Cisco Nexus 1000V Switch for Microsoft Hyper-V

Configuring NetFlow. Information About NetFlow. NetFlow Overview. Send document comments to CHAPTER

Procedure: You can find the problem sheet on Drive D: of the lab PCs. Part 1: Router & Switch

Cisco 7600 Series Route Switch Processor 720

Configuring NetFlow on Cisco IOS XR Software

Chapter 10 Troubleshooting

NetFlow v9 Export Format

FioranoMQ 9. High Availability Guide

SAN Conceptual and Design Basics

Troubleshooting an Enterprise Network

IxNetwork TM MPLS-TP Emulation

Configuring NetFlow-lite

SANbox Manager Release Notes Version Rev A

Chapter 3. Enterprise Campus Network Design

Chapter 28 Denial of Service (DoS) Attack Prevention

Lab Organizing CCENT Objectives by OSI Layer

Configuring SNMP and using the NetFlow MIB to Monitor NetFlow Data

Cisco ubr7200-npe-g2 Network Processing Engine

Understanding Quality of Service on the Catalyst 6500 Switch

Reboot the ExtraHop System and Test Hardware with the Rescue USB Flash Drive

Managing Rack-Mount Servers

CHAPTER 1 COMMANDS FOR DEVICE MANAGEMENT...

Course Contents CCNP (CISco certified network professional)

High Availability and MetroCluster Configuration Guide For 7-Mode

Cisco IOS Software Release 15.0(1)SY1 New Features and Hardware Support

Configuring NetFlow. Information About NetFlow. NetFlow Overview. Send document comments to CHAPTER

hp ProLiant network adapter teaming

Catalyst update & Local Manufactory

Configuring a Load-Balancing Scheme

isco Troubleshooting Input Queue Drops and Output Queue D

High Availability and Redundancy in Catalyst 4500 Series Switches

* Rev A* PN Rev A 1

Lecture 12: Network Management Architecture

UCS Network Utilization Monitoring: Configuration and Best Practice

GSR Hardware Configurations

Cisco Nexus 1000V Virtual Ethernet Module Software Installation Guide, Release 4.0(4)SV1(1)

Cisco Catalyst 2960-S FlexStack: Description, Usage, and Best Practices

Configuring EtherChannel and 802.1Q Trunking Between Catalyst L2 Fixed Configuration Switches and Catalyst Switches Running CatOS

CCNP Switch Questions/Answers Implementing High Availability and Redundancy

CHAPTER 10 LAN REDUNDANCY. Scaling Networks

TROUBLESHOOTING CISCO DATA CENTER UNIFIED FABRIC (DCUFT)

6/8/2011. Document ID: Contents. Introduction. Prerequisites. Requirements. Components Used. Conventions. Introduction

Supervisor Redundancy for the Cisco Catalyst 6500 Series Switches with Cisco Catalyst Operating System

OAM Operations Administration and Maintenance

ROM Monitor. Entering the ROM Monitor APPENDIX

Configuring Health Monitoring

Direct Attached Storage

Configuring NetFlow. Information About NetFlow. Send document comments to CHAPTER

Self-Managed Networks with Fault Management Hierarchy. Mehmet Toy, Ph.D

Cisco Unified Computing Remote Management Services

Maintaining Non-Stop Services with Multi Layer Monitoring

Configuring System Message Logging

7750 SR OS System Management Guide

Cisco Home Agent Service Manager 4.1

CiscoWorks Resource Manager Essentials 4.1

IBRIX Fusion 3.1 Release Notes

IP Application Services Commands show vrrp. This command was introduced. If no group is specified, the status for all groups is displayed.

Managing vlan.dat in Cisco Catalyst Switches Running Cisco IOS Software

Configuring VIP and Virtual IP Interface Redundancy

Transcription:

Generic On-Line Diagnostics 1

What Is Generic On-Line Diagnostics? (GOLD) 2

What Is GOLD? GOLD stands for Generic OnLine Diagnostics GOLD is a platform independent distributed framework that provides a common CLI and scheduling for runtime diagnostics GOLD is part of the Run-Time OS and consists of: Boot Up Diagnostics - during boot up & Online Insertion & Removal (OIR) Health Monitoring Diagnostics - while system is in operation On Demand Diagnostics - using CLI Schedule Diagnostics - using CLI 3

GOLD: How It Works? Is the supervisor control plane and forwarding plane functioning properly? Is the standby supervisor ready to take over? Are linecards forwarding packets properly? Are all ports working? Is the backplane connection working? Forwarding Engine CPU Fabric 4

GOLD: Fault Detection Diagnostics capabilities built in hardware Depending on hardware, GOLD can catch: Port Failure Bent backplane connector Bad fabric connection Malfunctioning Forwarding engines Stuck Control Plane Bad memory 5

What Does GOLD Address? 1/2 Fault Detection framework for High Availability Proactive Diagnostics serve as HA triggers Boot Up Diagnostics Health Monitoring Diag Quick Go/No-Go tests Disruptive & Non-Disruptive tests Periodic background tests Non-Disruptive tests Trouble Shooting Tool for TAC Reactive Diagnostics for trouble-shooting On-Demand & Scheduled Diagnostics Can run all the tests Include Disruptive tests used in manufacturing 6

What Does GOLD Address? (Cont.) 2/2 Consistency across Cisco products Features Behavior CRS-1 Cisco 7600 Catalyst 6500 Catalyst 4500 Catalyst 3750 CLIs Reporting 7

High-Level Software Architecture NMS Layer Fault Policy Manager & other NMS Applications Traffic Re-Route & Remote GUI MIB/SNMP Embedded Syslog Manager Embedded Event Manager Notification and Root Cause Analysis Subsystem Call- Home Notifications & Corrective Actions IOS Layer SEA & OBFL GOLD Subsystems Diagnostics Subsystem Platform Specific Diagnostics Provide Generic Diagnostics & Health Monitoring Framework Detect System Issues (HW, SW, Config errors) Runtime Software Drivers HW Layer Hardware 8

GOLD Test Suite 9

Catalyst 6K GOLD Test Suite (1/2) Boot up Diagnostics Forwarding Engine Learning Tests L2 Tests (Channel, BPDU, Capture, etc.) L3 Tests (IPv4, IPv6, MPLS, etc.) Span and Multicast Tests CAM lookup tests (FIB, NetFlow, QoS CAM, etc.) Port loopback test Health Monitoring Diagnostics SP-RP inband ping test (Sup s SP/RP, EARL(L2 & L3), RW engine) Fabric Channel Health test (Fabric enabled line cards) MacNotification test (DFC line cards) Non Disruptive Loopback test (new line cards) Scratch register test (PLD & ASICs) 10

Catalyst 6K GOLD Test Suite (2/2) On-Demand Diagnostic Exhaustive memory test Exhaustive TCAM test Traffic Stress Testing All boot up and health monitoring tests can be run on-demand using CLI On-Demand Exhaustive tests can be used during pre-production staging. Schedule Diagnostic All boot up and health monitoring tests can be scheduled Schedule Switch-over 11

POST vs. GOLD Boot Up Diagnostics POST : Power On Self Test Test CPU sub-system, system memory and peripheral during early stage of OS bring up cisco WS-C6503 (R7000) processor (revision 1.1) with 491520K/32768K bytes of memory. Processor board ID FOX073904WS R7000 CPU at 300Mhz, Implementation 39, Rev 3.3, 256KB L2, 1024KB L3 Cache GOLD Boot Up Diagnostics Perform functional packet switching test and ASIC memory test using runtime driver just before the module is declared online. *Mar 6 21:52:17.751: %DIAG-SP-6-RUN_COMPLETE: Module 1: Running Complete Diagnostics... *Mar 6 21:52:30.940: %DIAG-SP-6-DIAG_OK: Module 1: Passed Online Diagnostics *Mar 6 21:52:36.805: %DIAG-SP-6-RUN_COMPLETE: Module 3: Running Complete Diagnostics... *Mar 6 21:52:45.300: %DIAG-SP-6-DIAG_OK: Module 3: Passed Online Diagnostics 12

Cat6K Online Diagnostic Methodology Boot-up diagnostics touch every single ASIC/memory device in the data path and control path. Perform Functional Testing combined with components monitoring to detect fault in passive components (connector, solder joint etc.) and active components (ASICs, PLDs etc.). Tests are written using run-time driver routines to catch SW defects. Non-disruptive tests are used as HA triggers. Both disruptive and non-disruptive tests are available on-demand as trouble shooting tools for CA/TAC. Root cause analysis and corrective actions are performed upon test failure. EEM will be used for configurable corrective action. (Tcl based) 13

Using Boot Up Tests 14

Boot Up Tests Feature Definition and Benefits Boot Up tests provide the ability to do quick Go/No-Go decisions on hardware Failing hardware is prevented from going into service Boot up diagnostics takes less than 10 seconds per module in complete mode and in the minimal mode it is about 5-7 seconds per module Boot up diagnostic level is stored as part of the switch configuration 15

Catalyst 6000 Boot Up Diagnostic Timing Diagram Active Sup Standby Sup 1 Diag run locally on each Sup (Forwarding Engine tests, Fabric Snake Test, Uplink ports test etc.) 1 Diag run locally on each Sup (Forwarding Engine tests, Fabric Snake Test, Uplink ports test etc.) 2 Standby Sup runs diag on Active Sup s uplink ports 3 Active Supervisor is Online 4 Active Sup runs diag on Standby Sup s uplink ports Active Supervisor will start bringing up the line cards one by one and run boot up diagnostics before declaring the line card online 5 Standby Supervisor is Online 6 7 8 Line card 1 is Online 9 Line card 2 is Online Line card 3 is Online Line card 4 is Online 16

Using Health Monitoring Tests 17

Health Monitoring Tests Feature Definition and Benefits Non-disruptive tests running in the background while the system is switching packets Ability to disable/enable Health Monitoring tests Any non-disruptive diagnostic test could be configured to be run as HM test Ability to change monitoring interval of each test (365 days down to 50 milliseconds granularity) Health-Monitoring info is stored in the switch configuration Health-Monitoring is SSO-compliant. Upon switchover, healthmonitoring tests will run from new active seamlessly 18

Health Monitoring Tests Test Name: TestSPRPInbandPing Test Description: Detects most runtime software driver and hardware problems on supervisor engines Test Coverage: Tests the Layer 2, Layer 3 and 4 forwarding engine, and the replication engine on the path from the switch processor to the route processor. Test Frequency: Packets are sent at 15-second intervals. Ten consecutive failures results in failover or reload Attributes: Global, Non-disruptive, Default-On Corrective Action: Reset the active supervisor engine 19

SP-RP Ping Test (Supervisor Health Monitoring Test) Supervisor Card Continuous validation of Forwarding Path L2 ASIC & L2 table memories SP CPU Diag Packet Diag Packet RP CPU Diag Packet Diag Packet Rewrite & Multicast Engine Diag Packet L3 ASIC & FIB table (TCAMS) 20

Understanding Best Practices and Limitations for Health Monitoring There might be transient test failures due to bus stalls, or switching mode change. It will recover later. No action is taken until the failure threshold is crossed At high traffic or high CPU util, some HM tests are skipped to avoid false failures Supervisor module crash due to HM test failure is a symptom detected by proactive monitoring and not the root cause 21

Using On Demand Tests 22

On-Demand Diagnostics Feature Ability to run tests from CLI Can be used by TAC to debug customer problems (field diagnostic), by engineers for internal debug User can run the test multiple times by configuring the iterations setting Allows user to configure what action to take upon failure either continue or stop Yes/No Confirmation before running any disruptive test 23

On-Demand Online Diagnostics CLIs diagnostic start module <module#> test {<test-id> <test-id-string> <test_name> minimal per-port non-disruptive} [port {<port#> <port#-string> all}] To run the specified Diagnostic test(s) at the specified slot. For per-port tests, the user can also specify the port number range. Dynamic Help is also provided to the user. Use non-disruptive flag to run all non-disruptive tests Use minimal flag to run all tests that run in minimal diags mode diagnostic stop module <module#> To stop the diagnostic running at the specified slot 24

On-Demand Online Diagnostics CLIs (Cont ) diagnostic ondemand iterations <iteration-count> To set the ondemand testing iteration count diagnostic ondemand action-on-failure {continue stop} [<failure-count>] To set the execution action when error is detected. The user can choose to continue or to stop when the test failure is detected. To stop the test after certain number of failures, use failure-count flag show diagnostic ondemand settings To display the settings for ondemand diagnostic 25

Case Study Situation: Action: Customer was running into a problem : packets ingress on a particular line card were getting dropped intermittently. All software/hardware entries etc were checked. TAC engineer requested customer to run line card memory test Results: Diagnostics results revealed that memory was failing. Line card was replaced and the switch functionality was restored in a very short time Thanks to GOLD!! 6500_cust# diagnostic start module 1 test 28 Module 1: Running test(s) 28 may disrupt normal operation Do you want to run disruptive tests? [no] yes Mar 17 15:58:34: SP: ****************************************************************** Mar 17 15:58:34: SP: * WARNING: Mar 17 15:58:34: SP: * ASIC Memory test on module 1 may take up to 1hr 30min. Mar 17 15:58:34: SP: * During this time, please DO NOT perform any packet switching. Mar 17 15:58:34: SP: ****************************************************************** Mar 17 16:10:27: SP: diag_scp_asic_mem_test [1/1/RN_PBIF]: LCP TEST FAILED. fail_addr = 0xE923, test_data result_data: 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 53, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, 55 55, Mar 17 16:10:27: SP: do_mem_test [1/1]: test RN_PBIF memory failed Mar 17 16:10:27: SP: ****************************************************************** Mar 17 16:10:27: SP: * WARNING: Please RESET module 1 prior to normal use. Also, packet Mar 17 16:10:27: SP: * switching tests will no longer work (i.e. test failure) because Mar 17 16:10:27: SP: * its memories are filled with test patterns. Mar 17 16:10:27: SP: ****************************************************************** MarchCMem: got data mismatch at addr: 0xE923, dev#: 1 rc = 0x12 comparison data rslt: 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 53 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 Mar 17 16:10:27: %DIAG-SP-3-TEST_FAIL: Module 1: TestLinecardMemory{ID=28} has failed. Error code = 0x1 26

Using Scheduled Tests 27

Schedule Diagnostics Feature Ability to schedule tests to run at certain time or daily/weekly (i.e. one-time or periodically) Can create unlimited number of schedules 28

Schedule Online Diagnostics CLIs [no] diagnostic schedule module <module#> test {<test-id> <test-idrange> all} [port {<port#> <port#-range> all}] on <month> <date> <year> <hh:mm> To schedule diags to run on a specific day at a particular time (runs only once) [no] diagnostic schedule module <module#> test {<test-id> <test-idrange> all} [port {<port#> <port#-range> all}] daily <hh:mm> To schedule diags to run daily at a particular time [no] diagnostic schedule module <module#> test {<test-id> <test-idrange> all} [port {<port#> <port#-range> all}] weekly <day-of-week> <hh:mm> To schedule diags to run weekly on a particular day at a particular time 29

Schedule Switchover Schedule Switchover allows the customer to exercise standby Supervisor at pre-defined time Best Practice Issue two schedule switchover config. The first command to switchover from active to standby The second command is to schedule a switch over back from the new active supervisor to new standby supervisor about 5 minutes later 30

Other Information 31

What Happens When A Test Fails? Depending on the type of test failure, GOLD triggers Supervisor switch-over Fabric switch-over Port shut down Line card reset Line card power down Generates a syslog message Generates a call-home message Informs Embedded Event Manager (EEM) to invoke other actions configurable via EEM Tcl script 32

Typical Syslog Messages GOLD related Syslog messages will start with the string DIAG or CONST_DIAG Sample text of a syslog message: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 2: TestL3VlanMet failed %CONST_DIAG-SP-2-HM_MOD_RESET: Resetting Module 3 for software recovery, Reason: Failed TestMacNotification DIAG-SP-3-TEST_FAIL: Module 5: TestTrafficStress{ID=24} has failed. Error code = 0x1 More detailed Syslog information: http://www.cisco.com/univercd/cc/td/doc/product/lan/cat6000/122sx/msgguide/emsg.ht m#wp1293376 33

Items To Avoid/Remember Do not run any packet switching tests following exhaustive memory tests. The loop back tests will fail. Do not assume diagnostics failure is a confirmation that hardware is defective especially for boot up tests. Run exhaustive tests to confirm since defect could be due to software failure in some cases. Non optimal network configurations that result in oversubscribing device capabilities will cause diagnostics to fail. 34

Other Tips To Remember & Further Reading When Interacting With Cisco TAC Most common request from TAC engineers when working on a Cat 6K service request show diagnostics event all show diagnostics results module all detail Further Reading: Cisco.com documentation link http://www.cisco.com/univercd/cc/td/doc/product/lan/cat60 00/122sx/swcg/diagtest.htm 35

Conclusion GOLD provides the generic diagnostics and health monitoring framework to detect hardware and some software issues GOLD can be used proactively to provide High Availability triggers in the event of a hardware failure GOLD can be used as part of the troubleshooting process to pin point a specific functional area of the hardware that is failing 36

37