Next-Generation Switch/Router Diagnostics and Debugging

White PAPER Next-Generation Switch/Router Diagnostics and Debugging Introduction The goal for a next-generation diagnostics and debugging system for switch/routers is to improve system uptime and availability by preventing the occurrence of certain types of hard errors (increasing system MTBF) and by improving fault isolation and resolution time (reducing MTTR). These improvements can be realized by augmenting traditional reactive diagnostics with proactive, autonomous components that can run in the background or can be triggered by events rather than being initiated under operator command. Autonomous functionality can include error checking that proactively verifies correct operation of various hardware and software subsystems plus the capability to take automated actions when an error or exception is detected. The automated actions may take a number of forms, including: Notification of network operations personnel of the detected error or exception Initiation of further system checks and information logging to help isolate the condition to a specific FRU Notification of network operations personnel of which FRU is causing the problem Saving all relevant information in a core dump file or other crash log file in the event of an unplanned system or subsystem reset A more proactive approach to verifying the system s operational correctness will have a number of benefits that can enhance system and network reliability: Early detection of conditions that could eventually lead to a hard error. If corrective action can be taken before a hard failure occurs, system MTBF measured in the field will improve significantly More accurate reporting of errors/exceptions with fewer false positives and better fault isolation. Pre-determination of the root cause FRU reduces time wasted using trial and error swap-outs to identify faulty components and results in a significant reduction in MTTR Better detection and isolation of soft errors maximizes the availability of the network and improves measured user response times for network applications Force10 Networks is implementing a next-generation diagnostics and debugging approach across all product lines that includes: A comprehensive suite of reactive diagnostics accessible via the CLI with appropriate show and debug commands to determine root cause and to isolate the fault to a specific FRU A growing suite of proactive system health checks that run autonomously whenever the system is in operation. These diagnostics can detect and report errors via a syslog message and also can be configured to take action in real time to minimize the impact of an error Reactive vs. Reactive/Proactive Diagnostics The classical monitor, detect, isolate, resolve model for system diagnostics and debugging is shown in Figure 1. With a primarily reactive approach to diagnosis and debugging, most of the isolation and resolution phases of the process are quite labor intensive, while there is often a reasonable degree of automation in the monitoring and detection phases. Figure 1. Monitor, detect, isolate and resolve model With a balanced reactive/proactive approach to diagnosis/debugging, the basic model remains the same. However, the proactive/automated components of the system can greatly improve operator effectiveness, while at the same time reduce the labor intensity, especially for the isolation and resolution phases. Figure 2 shows conceptually the impact that a reactive/proactive system can have on the manual labor component of each phase of diagnosis/debugging. In an ideal system, monitoring, detection and isolation would be almost entirely automated and only the resolution phase would still have a manual time component to account for the physical replacement of FRUs. 2008 FORCE10 NETWORKS, INC. [ P AGE 1 OF 6 ]

status of the data plane and can be used to identify a single faulty SFM. If three consecutive RPM loopback test frames are dropped and all SFMs are enabled, the system attempts to isolate and then disable a faulty SFM by automatically "walking" the SFMs. In this process, each SFM is placed sequentially in an offline state and another set of RPM loopbacks is performed until the faulty SFM or SFMs are identified. If the fault is isolated to a single SFM, it is disabled and this event is logged. The system does not perform an SFM walk if one of the installed SFMs is already disabled. Figure 2. Impact of Proactive and Automated Diagnosis/Debugging Proactive Diagnostics and Debugging Enhancements This section of the document provides an overview of some of the proactive diagnostic and debugging features that are available in FTOS for the E-Series switch/routers. Runtime System Health Checks System health checks (or health monitors) run proactively while the system is operational and are intended to detect errors in data transfers within the system. System health checks are performed using test frames that are interspersed among frames carrying user data and system messages. As a result, system health checks do not disrupt the normal flow of traffic within the device. One example of a system health check for the E-Series switch/router is the data plane loopback test. As shown in Figure 3, the test has two parts, a loopback from the line card through the switch fabric, and a loopback from the route processor module (RPM) through the switch fabric module (SFM). Each RPM and each line card CPU periodically sends out test frames that loopback through the SFMs. The loopback runtime test results reflect the overall health The set of system messages corresponding to an RPM loopback failure followed by a successful SFM walk is shown below. %TSM-2-RPM_LOOPBACK_FAIL: RPM-SFM dataplane loopback test failed %TSM-2-SFM_WALK_START: Automatic SFM walk-through started %TSM-6-RPM_LOOPBACK_PASS: RPM-SFM dataplane loopback test succeeded %TSM-2-BAD_SFM_DISABLED: Bad SFM in slot 0 detected and disabled %TSM-2-SFM_WALK_SUCCEED: Automatic SFM walk-through succeeded When an SFM is disabled, an alarm is sent. (Note that the following message applies to the E1200 which has nine SFMs enabled during normal operations): %CHMGR-2-MINOR_SFM: Minor alarm: only eight working SFM The data plane loopback test offers configuration options that allow the operator to select the action the system will take in the event of a failed test. This allows system behavior to be made consistent with uptime/ availability targets and hardware sparing level policies. If Figure 3. E-Series data plane loopback test 2008 FORCE10 NETWORKS, INC. [ P AGE 2 OF 6 ]

desired, the automatic SFM walk that is launched by default after an RPM-SFM runtime loopback test failure can be disabled. Also the automatic shutdown of a single faulty SFM identified by an SFM walk can be disabled. If a line card (LC-SFM) runtime loopback test fails, the system does not launch an SFM walk, but simply logs an error message indicating the failure. %TSM-2-RPM_LOOPBACK_FAIL: Linecard-SFM dataplane loopback test failed on linecard 6 SFM Channel Monitoring In addition to proactively monitoring the data plane for dropped frames, there are several additional proactive data integrity checks for traffic going across the backplane. One of these is SFM channel monitoring using the Per-Channel De-skew FIFO Overflow (PCDFO) polling feature. As with the data plane loopback feature, the PCDFO polling feature is enabled by default. Figure 4 illustrates the E600 and E1200 switch fabric architecture. Each ingress and egress buffer and traffic management (BTM) ASIC maintains nine channel connections to the TeraScale Switch Fabric (TSF) ASICs on the SFMs. The PCDFO is designed to detect a bad channel on an SFM, RPM or line card. It performs this function by breaking a test frame into segments and striping it across all of the SFM channels between the ebtm and the ibtm. The egress BTM ASIC must receive each segment of striped data within a specified time interval (t) in order to consider the segments to have the proper temporal alignment. Small timing skews, less than t, can be tolerated because the BTMs maintain a small FIFO memory pool to allow the segments to be realigned before forwarding. A PCDFO poll fails when the skew among the segments exceeds t. When this happens, the realignment FIFOs overflow and some of the segments are dropped. There are two classes of errors that can lead to PCDFO poll failures. The errors in the first class are transient in nature. Transient errors are considered random events that may occur only one time or may recur only for short time periods. Because such transient events tend to occur very infrequently, they do not have a measurable effect on switch/router performance or functionality. Since these events are not repeatable, no action is required by network operations staff. The second class of errors that can cause PCDFO failures is systematic in nature. Systematic errors are repeatable events because they are generally caused by persistent malfunctions in hardware devices or components. Detection of a PCDFO event causes the system to generate a message similar to the following: %RPM1-P:CP %CHMGR-2-SFM_PCDFO: PCDFO error detected for SFM # Figure 4. SFM channel monitoring on an E1200 or E600 line card 2008 FORCE10 NETWORKS, INC. [ P AGE 3 OF 6 ]

Events are logged when a PCDFO error first occurs on any SFM and when PCDFO error pattern changes. For transient errors, PCDFO error messages can be expected to disappear after a very short duration without external intervention. The hardware system automatically recovers from the transient error state, and the data plane continues to function properly. For persistent errors, additional reactive diagnostics will need to be run to isolate and resolve the root cause. For example, to confirm that an identified SFM needs to be replaced, a manual data plane loopback test could be executed. If an error persists and cannot be diagnosed, it may be necessary to contact Force10 technical support. Parity Error Scanning E-Series RPMs and line cards support multiple parity checking points to ensure data integrity throughout the internal forwarding, lookup and buffering path. Parity errors can be either transient or persistent. The traditional, manual approach to differentiating between transient or persistent parity errors is to reset the card and monitor for a second occurrence of the error. This approach has the disadvantage of possibly encountering additional data errors until the diagnosis of a persistent error is confirmed. FTOS now uses an automated memory scanning diagnostic to proactively determine whether a parity error is a transient error or a persistent (hard) error. Figure 5 shows a comparison of the traditional, reactive process with the automated, proactive process of diagnosing parity errors. This example is based on parity scanning for the FPC SRAM, which stores the next-hop lookup table. In the proactive process, all error messages are saved in the syslog regardless of whether the error was diagnosed as transient or persistent. In the instance illustrated in Figure 5, the proactive diagnostic has maximized uptime by indicating clearly when no reset or other action is required. In the case of a persistent error, the diagnostic would identify the exact location of a parity error and automatically rewrite that location in an attempt to fix the entry. Automatic correction brings final resolution to a customer-reported event. Only extremely rare cases would call for the reset or replacement of the line card in question. In addition, any non-recoverable parity error events are reported via an SNMP trap in the FORCE10-CHASSIS-MIB. Automatic Information Collection Triggered by Software Exceptions The E-Series automatically collects critical fault information when an RPM or line card resets or experiences a failure. This proactively gathered information is saved to flash memory in one of several files, which can be reviewed by Force10 Networks technical support personnel to help isolate the cause of the error. The system preserves fault information in the following files: Core Dump: The most extensive crash log file consisting of a dump of the entire memory space on the line card or RPM Failure Trace Log: This file preserves the contents of the buffered trace log, which contains messages about internal FTOS software task events Sysinfo: This file captures the status of counters on the Ethernet interfaces of line card and RPM CPUs to the inter-cpu party bus. Information relevant to FTOS control plane communications among the E-Series CPUs is captured in this file NVTrace: This file captures and preserves critical line card status information prior to the occurrence of the last software exception Figure 5. Comparison of reactive and proactive parity troubleshooting processes 2008 FORCE10 NETWORKS, INC. [ P AGE 4 OF 6 ]

Table 1 provides a summary of the types of file that are captured by RPMs and line cards. Core Dump Trace Log Sysinfo NVTrace RPM yes yes yes no Line Card yes yes yes yes Table 1. Automatically preserved diagnostic information files In addition, command history is a proactive/reactive feature that can be used to see whether a particular sequence of system commands may have led to a software exception. The command history file proactively stores a time-stamped log entry that uniquely identifies each command as it is executed. The contents of the file can be used reactively by Force10 Networks technical support personnel and Force10 Networks engineering to help isolate the cause of the exception. Force10#show command-history 15 [9/1 15:9:28]: CMD-(CLI):[interface gigabitethernet 12/0] by default from console [9/1 15:11:51]: CMD-(CLI):[show start up-config] by default from console [9/1 15:24:24]: CMD-(TEL46):[enable] by admin from vty0 (peer RPM) [9/1 15:25:23]: CMD-(TEL46):[show interfaces managementethernet 1] by admin from vty0 (peer RPM) Reactive Diagnostics and Debugging Enhancements Even as more emphasis is placed on proactive diagnostics, reactive diagnostic features continue to play an essential role in isolating errors and maximizing system uptime. Reactive diagnostics that are designed specifically to complement proactive diagnostic features are expected to play a significant role as the switch/router serviceability model continues to evolve toward the ideal of fully automated fault monitoring, detection and isolation, coupled with partially automated resolution. The remainder of this section of the document focuses on enhancements to the FTOS reactive diagnostics and debugging features for the E-Series. Offline Diagnostics The offline diagnostics test suite available in the FTOS system image for the E-Series can be useful for reactive fault isolation and debugging of offline line cards. Offline diagnostics are useful in isolating a fault symptom, such as interface errors or unexplained packet loss to a line card. The offline diagnostics tests are grouped into three levels and generally cover the following functions: Verify the existence of an ASIC or other device Test the device s internal parts (e.g. registers) Perform data-path loopback tests Offline diagnostics are invoked from the FTOS CLI. While diagnostics are running, the status can be monitored via the CLI. The test results are written to a file in flash memory and can be displayed on screen. Force10#show file flash:/testreport- LC-4.txt Starting level0 diags **** DIAG LEVEL 0 **** + Pci Bridge1 level0 test...pass + Pci Bridge2 level0 test...pass + MANAGEMENT_FEC level0 test...pass + C2PORT_FEC level0 test...pass + PARTY_BUS_FEC level0 test...pass Detailed statistics for all tests are collected. Such statistics include last execution time, first and last test pass time, first and last test failure time, total run count, total failure count, consecutive failure count, and error code. Runtime Hardware Monitoring FTOS continuously monitors the status of hardware by polling hundreds of registers on key ASICs and system components on line cards, RPMs and SFMs, and writing this information to a file. The log is initialized at system startup and continues to dynamically log events and errors as they occur. The show hardware commands comprise a key reactive component of the overall diagnostic/debugging system that allows a detailed look at various counters and statistics gathered at multiple points within the system data and control planes. The show hardware commands provide the operator with greatly enhanced visibility into the finer details of the E-Series hardware architecture. For example, an experienced operator can perform detailed fault diagnosis by correlating the results of the show hardware command with the proactively gathered information contained in the syslog file. Show hardware commands are also an important tool used by Force10 Networks technical support engineers to assist customers in fault diagnosis. 2008 FORCE10 NETWORKS, INC. [ P AGE 5 OF 6 ]

The show hardware command tree includes a number of privileged EXEC commands that have been created or changed specially for use with the E-Series. Figure 6 shows some examples of the show hardware commands for monitoring the data plane and control plane (party-bus) of the RPM. Show hardware commands that apply to the E-Series line cards provide a similar degree of hardware visibility by exposing the contents status registers and drop counters. E-Series show hardware commands that apply to line cards include the following: Buffer and Traffic Management Commands: BTM commands are used for accessing information on the buffer and traffic management (BTM) ASIC on a line card. Command options are available to view or clear various counters, status registers and queue information Flexible Packet Classification Commands: The flexible packet classification (FPC) ASIC provides line-rate traffic classification for QoS and ACLs. Command options are available for displaying advanced debugging information about the FPC. For the forwarding functional area of the FPC, it is possible to display the contents of receive and transmit counters, error counters, and status registers. For the lookup functional area of the FPC it is possible to display advanced debugging information The hardware monitor feature provides a configurable option for automatic action by the system when certain types of hardware events occur. Force10(conf)#hardware monitor mac action-on-error port-shutdown Force10(conf)#hardware monitor linecard asic btm action-on-error? card-problem card-reset card-shutdown The card-problem action-on-error provides an option to leave a card in a problem state (show linecard will display the status as online card problem ) so that Force10 engineering and TAC could access the system and collect diagnostic data for our further knowledge. Summary/Conclusion Force10 Networks has made significant progress toward its vision of a next-generation diagnostics and debugging approach for the E-Series switch/router family based on evolution of FTOS. In future versions of FTOS, Force10 will continue to enhance both the reactive and the proactive components of the system to allow network operators to maximize the availability of their networks and gain the maximum reliability advantages from the hardware and software resiliency features. Figure 6. Examples of show hardware commands for the RPM Force10 Networks, Inc. 350 Holger Way San Jose, CA 95134 USA www.force10networks.com 408-571-3500 PHONE 408-571-3550 FACSIMILE 2008 Force10 Networks, Inc. All rights reserved. Force10 Networks and E-Series are registered trademarks, and Force10, the Force10 logo, Reliable Business Networking, Force10 Reliable Networking, C-Series, P-Series, S-Series, EtherScale, TeraScale, FTOS, SFTOS, StarSupport and Hot Lock are trademarks of Force10 Networks, Inc. All other company names are trademarks of their respective holders. Information in this document is subject to change without notice. Certain features may not yet be generally available. Force10 Networks, Inc. assumes no responsibility for any errors that may appear in this document. WP22 508 v1.5 2008 FORCE10 NETWORKS, INC. [ P AGE 6 OF 6 ]