NETAPP WHITE PAPER Improving the Customer Support Experience with NetApp Remote Support Agent Ka Wai Leung, NetApp April 2008 WP-7038-0408
TABLE OF CONTENTS 1 INTRODUCTION... 3 2 NETAPP SUPPORT REMOTE AGENT OVERVIEW... 3 2.1 FEATURES... ERROR! BOOKMARK NOT DEFINED. 2.2 FEATURES REMOTE SUPPORT AGENT COMPOENNTS...4 2.3 NETAPP SUPPORT...5 3 ARCHITECTURE... 6 3.1 REMOTE SUPPORT ENTERPRISE...6 3.2 REMOTE SUPPORT AGENT...6 3.3 DATA ONTAP INTERFACE...7 3.4 AUTO CORE UPLOAD...8 3.5 SECURITY...8 4 SUMMARY... 10 2 Improving the Customer Support Experience with NetApp Remote Support Agent
1 INTRODUCTION Storage is a mission-critical component of any IT infrastructure, and customers expect a high level of SLA from their storage vendors. NetApp Support is responding to this challenge through support offerings that are built around people, process, and technology. Remote support automation is one key technology component that helps to reduce case resolution time and customer manual overheads. Remote monitoring and data collection are essential components of a remote support offering. NetApp Support statistics show a probability of greater than 50% to close a case within one day (likely within the first one to two hours) if a support engineer has access to required diagnostics data. Without access to proper diagnostic data (e.g., log files, perfstat, or AutoSupport), the probability of closing a case within one day drops to less than 35%. Downtime is expensive Retail Insurance Information technology Financial institutions Manufacturing Call location Telecommunications Credit card services Energy Retail brokerage $M/Hour 1.1 1.2 1.3 1.5 1.6 1.6 2 2.6 2.8 Millions of dollars per hour in lost revenue NetApp customers, especially those with SupportEdge Premium contracts, expect minimal interaction with NetApp Support during case handling. Customers expect a high degree of remote support automation so they can minimize the amount of time they spend on the phone with Support. Data collection are timeconsuming tasks for Support because they need to find the customer to explain which data to collect and how to gather and send the data, and then wait for the data to arrive. This process is often repeated multiple times during the case triage cycle. 6.5 Core file handling can be another laborious customer task because customers must manually upload the files. For some NetApp controllers, the core file can be in the range of multiple gigabytes. Uploading these core files over a slow or unstable Internet link can be a challenge and a cumbersome task. 2 NETAPP REMOTE SUPPORT AGENT OVERVIEW 2.1 FEATURES The NetApp Remote Support Agent (RSA) is a fast, simple, and reliable design that supports three major features: Remote data collection RSA allows NetApp Support to remotely collect selected files from the /etc/log and /etc/crash directories. It also allows Support to remotely trigger an AutoSupport event on the customer s NetApp controller and return a complete AutoSupport log. Intelligent core handling When the NetApp system panics or reboots, the RSA intelligently uploads the core file to NetApp without customer intervention. A core file will only be uploaded if it is not corrupted and the panic signature does not match any known condition in NetApp s panic message database. Down appliance notification RSA will detect when a NetApp controller is down (e.g., not serving data) and will automatically open a case on the customer s behalf with NetApp Support. 3 Improving the Customer Support Experience with NetApp Remote Support Agent
2.2 REMOTE SUPPORT AGENT COMPONENTS The RSA has two major components: 1. Embedded agent within the NetApp controllers (Remote Support Agent) The RSA will be incorporated into the existing Remote LAN Management (RLM) card. The RLM is a service processor supported on the FAS and V 3000 and 6000 series. The RSA will be delivered as a firmware upgrade to the RLM. Customer will have the option of enabling or disabling the RSA during the RLM configuration process. Firmware updates of the RLM can occur without powering down the NetApp controller or requiring any Data ONTAP changes. 2. Back-end infrastructure residing at NetApp (Remote Support Enterprise) The RSA communicates with the Remote Support Enterprise (RSE), a back-end infrastructure residing within NetApp. Through the Remote Support Enterprise UI, a NetApp support engineer can issue data collection commands to the RSA and view the information returned. The RSE also provides a data store in which the collected data will be retained for 30 days before being deleted. Customer NetApp Support HTTPS Internet Firewall Firewall RSA Customer Data Repository Resides on RLM Secured access model Non-disruptive upgrade Functional even when appliance is down Appliance down notification Optimized CORE handling Remote data collection 4 Improving the Customer Support Experience with NetApp Remote Support Agent
2.3 NETAPP SUPPORT INTERACTION NetApp Support interacts with the RSA via a simple user interface on the RSE. The RSA is tightly integrated with NetApp s case flow process. During case triage, the case information screen highlights the presence of the RSA to the NetApp support engineer. The support engineer can then access the RSE to perform remote data collection or look at data already collected. The RSE user interface is behind NetApp s firewall and can only be access by NetApp Support personnel. The simple yet functional UI is encapsulated into a single screen and contains the following sections: RSA configuration summary Shows status of the agent, firmware level, and contact frequency with the NetApp controller. NetApp controller summary Shows model information, status of the controller, and whether AutoSupport is enabled. Recent activities Displays all recent activities, alarms, and events from the RSA. Remote data collection Enables the support engineer to remotely trigger a complete AutoSupport upload to NetApp, collect a file listing from /etc/crash and etc/log directory, and upload specific files from either /etc/log or /etc/crash directory. Uploaded files and status Displays a list of recently uploaded files or files in the process of being uploaded and their status. 5 Improving the Customer Support Experience with NetApp Remote Support Agent
3 ARCHITECTURE The RSA architecture consists of an embedded agent within the NetApp system owned by the customer and the RSE that is hosted at NetApp. The RSA, running on the RLM service processor, will initiate a secure, authenticated connection to the RSE and provide NetApp Support with remote access to logs, core files, and other diagnostic information. The RSA architecture also provides customers with complete control to audit RSA usage and disable all features via a single CLI command. 3.1 REMOTE SUPPORT ENTERPRISE The RSE is a Web application that communicates with the RSA. The RSE provides NetApp Support with a user interface for device interaction and monitoring. The RSE application runs on clustered hardware to provide high availability and performance. The RSE uses an Oracle backend database with both the database and file systems residing on NetApp storage systems. The RSE infrastructure is tightly integrated with NetApp Support s SAP CRM tools to provide seamless interface for automatic case generation and case handling workflow integration. The RSE monitors the heartbeat of the RSA on a constant basis. The RSE can dynamically configure the duration of heartbeat monitoring based on these scenarios: First-time device registration if the NetApp system has SupportEdge Premium entitlement, heartbeat monitoring will be set to every five minutes. Heartbeat monitoring is set to every 24 hours if the system does not have SupportEdge Premium entitlement. Case creation upon specific case creations for the NetApp system, heartbeat monitoring will be increased to 10 seconds for a duration of 24 hours, after which it will return to the default 5-minute rate. RSE UI access by NetApp Support the heartbeat check rate will automatically be changed to 10 seconds when NetApp Support accesses the RSE UI to perform remote data collection. This will allow NetApp Support to interact quickly with the NetApp system during case triage. 3.2 REMOTE SUPPORT AGENT The RSA is delivered to customers as a firmware extension of the RLM service processor. The RSA code is compact and resides within RLM s flash memory space. RLM Overview The RLM is a service processor integrated into FAS and V 3000 and FAS6000 series. It continuously monitors the storage controller and provides platform management capabilities, including console redirection, monitoring, logging, and alerting. Upon detection of a downappliance event at the controller, the RLM will trigger an AutoSupport notification to NetApp Support. The RLM is implemented as a small form-factor embedded microcomputer, with a dedicated Ethernet interface. It is powered by standby voltage, which is available as long as the system has input power to at least one of the system s power supplies. The RLM has its own embedded operating system that runs independent of Data ONTAP. These features allow the RLM to stay operational regardless of the operating state of the storage controller. This architecture also 6 Improving the Customer Support Experience with NetApp Remote Support Agent
allows the RLM firmware to be updated asynchronously from Data ONTAP and in a nondisruptive manner. Remote Support Agent The RSA extends the current feature set of the RLM to include remote support automation. Like the RLM, the RSA is only supported on FAS and V 3000 and 6000 systems. The RLM has its own Ethernet port. The RSA uses this port to provide an outbound authenticated communication channel from a customer location to NetApp over HTTPS. The RSA periodically connects to the RSE at NetApp and provides notification to NetApp of events of interest such as a change in a monitored server state. It also downloads and processes commands from NetApp to upload log files and core files or to trigger Data ONTAP AutoSupport. The RSA constantly monitors its own configuration by running health checks. This health monitor will send out an alert to the RSE if errors are detected in the configuration (e.g., userid/password errors and connection failure between RSA and the NetApp controller). The RSA is intelligent enough to send any alert once when the error is detected (no repeated alerts). The RSA is very efficient on network bandwidth. When the RSA sends its heartbeats every five minutes back to the RSE, these packets are on the order of less than a kilobyte of data. Unlike other remote monitoring architectures, the RSA does not perform data collection on a regular interval basis and will only collect data on demand (e.g., when requested by NetApp Support in the context of case handling). Even the log and core files it collects are compressed before they are sent to NetApp. 3.3 DATA ONTAP INTERFACE The RSA provides log and core file collection from Data ONTAP. The RSA does not have direct hardware access to this data, but uses various interfaces to proxy its requests to Data ONTAP. The RSA only collects data from the /etc/log and /etc/crash directories, which reside on the root volume of Data ONTAP. RSA access to these directories is through FilerView which is a part of Data ONTAP. NetApp Support can also use the RSA to trigger on-demand AutoSupport on the NetApp system. Having the most recent AutoSupport data helps Support improve case resolution time. To trigger an AutoSupport, RSA uses a telnet/rsh protocol to issue this command over the local customer network to the NetApp system. The AutoSupport triggered by RSA is the standard Data ONTAP AutoSupport and uses standard AutoSupport settings. This AutoSupport message is uniquely identified by a Subject line of "Remote Support Agent triggered ASUP". 3.4 AUTO CORE UPLOAD Even with compression, core files on some NetApp systems can approach 4GB to 20GB+ in size. Core file handling can be a cumbersome task when the customer has to copy the core file from the NetApp system to an intermediate host and transfer it to NetApp. Sending a large core file over the Internet can take many hours. To minimize transmission issues over unreliable networks, some customers in the past had to manually break apart large core files into multiple pieces before sending them to NetApp. NetApp, in return, had to reassemble these pieces into a 7 Improving the Customer Support Experience with NetApp Remote Support Agent
contiguous file once received. To ease core file handling work, the RSA automates the entire core upload process. The RSA is designed to provide reliable and resilient upload of core files, including mechanisms to handle upload failures. When a NetApp system panics, it dumps the memory content to disks and reboots. Data ONTAP then collects this memory content and saves it into a core file. Upon reboot, Data ONTAP sends a RPANIC AutoSupport message to NetApp that contains the panic message and related back trace code. This RPANIC AutoSupport message is parsed by the Panic Message and Backtrace Analyzer (PMBTA) back at NetApp. The PMBTA tool then analyzes the core signature and determines if it is a known issue. If a match is found, then uploading the core file is not required. If no match is found, the core file will be uploaded to NetApp Support for analysis. The RSE automates core upload by instructing the RSA to upload this core file to NetApp. To ensure data integrity, the core file has a checksum and the RSA also performs retries when it encounters network upload failures. After the core file has reached the RSE in its entirety, the core is moved to NetApp Support for processing and the core record in the corresponding support case is updated. In addition to automated core upload after a panic condition, the RSA allows NetApp Support to retrieve core files on demand from a list of all core files available on a NetApp system. Auto core uploading minimizes customer manual overheads. Furthermore, it enables NetApp Support to begin diagnostics much sooner than if the core were uploaded manually. Delays such as contacting and waiting for the customer to upload the core are eliminated. 3.5 SECURITY NetApp has taken an end-to-end approach in providing a secure, trusted, and verifiable infrastructure for the RSA. The customer has full control and visibility over all remote access events and activities. The customer can disable the connection to NetApp and all RSA features via a single CLI command. 3.5.1 Agent Security The RSA always initiates the connection to the RSE and not vise versa via an outbound only internet connection to NetApp. The RSA periodically connects to the RSE, downloads any action requests, and deposits the system status and results of previous requests. The RSA architecture does not allow NetApp to dial into a customer system. This is similar to how a Web browser opens a connection to a Web site and not the other way around. Communication between the RSA and NetApp is authenticated and encrypted with 128-bit Verisign SSL certificates. The RSA retains a copy of the NetApp public certificate to ensure that communication only occurs with NetApp. If the authentication fails, the connection is broken and no data is sent. RSA has no direct access to the root volume on the NetApp controller. It uses FilerView s HTTPS interface to request diagnostic data and core files from Data ONTAP. FilerView is hardcoded to only allow read access to "/etc/log" and "/etc/crash" directories on the root volume. Read access to other directories is prohibited. RSA also checks and rejects any file upload requests originating from RSE if they fall outside these two directories. 8 Improving the Customer Support Experience with NetApp Remote Support Agent
3.5.2 Remote Support Enterprise Security Access to the RSE is restricted to authorized NetApp Support personnel only. All NetApp Support s interactions with the RSE are recorded and auditable by the customer through NetApp s NOW site (please see screen example below). Logs and message files uploaded to the RSE are kept for 60 days and then erased. 3.5.3 Symantec Security Audit NetApp has partnered with Symantec to assess the security of the RSA to ensure it conforms to industry best practices. Symantec was chosen because of its leadership position in the security solution space. After a detailed analysis of the RSA, Symantec concluded that it provides a secure framework for data protection and meets industry security best practices. For more information, see the Network Appliance Remote Support Agent Security Assessment white paper by Symantec. To ensure continued adherence to security best practices, NetApp intends to contract with Symantec to perform security assessments on a regular basis for each update of the RSA. 9 Improving the Customer Support Experience with NetApp Remote Support Agent
4 SUMMARY NetApp Support statistics show greater than 50% case resolution within the first hour if Support has immediate access to appropriate diagnostic data. The RSA is designed to automate this data collection for NetApp Support. By deploying the RSA, customers will see improve case resolution time and system availability while minimizing their interaction with Support over the phone. The Remote Support Agent is a critical component of NetApp s support automation strategy. www.netapp.com 2008 NetApp. All rights reserved. Specifications are subject to change without notice. NetApp, the NetApp logo, Go further, faster, NOW, and Data ONTAP are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. Oracle is a registered trademark of Oracle Corporation. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. 10 Improving the Customer Support Experience with NetApp Remote Support Agent