Monitoring Guidelines for Microsoft.com and Update.Microsoft.com Published: June 2010 Overview: This document shares experiences in monitoring and diagnostics from the Operations team running Microsoft.com and Update.Microsoft.com. This document provides information on the monitoring tools available for all applications hosted in MSCOM data centers and is intended to serve as a guideline for project teams (project managers, developers, test, and operations) during various stages of the project to assist those teams in ensuring that monitoring is fully covered as part of the final deployment. Author: Brian Copps 2010 Microsoft Corporation. All rights reserved. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, System Center, Azure, and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Table of Contents TABLE OF CONTENTS... 2 SUMMARY... 3 OVERVIEW AND BACKGROUND... 3 SYSTEM CENTER OPERATIONS MANAGER: HOW IT WORKS... 3 Event Monitors... 3 Performance Threshold Monitoring... 3 User Perspective Monitoring... 4 Alerts... 4 HOW ERRORS ARE TRACKED... 4 HOW OPERATIONS USES THESE TOOLS... 4 USER PERSPECTIVE MONITORING WITH SYSTEM CENTER OPERATIONS MANAGER... 4 EVENT MONITORS AND SYSTEM CENTER OPERATIONS MANAGER... 5 MONITORING BEST PRACTICES... 6 FOR MORE INFORMATION... 8 2 P a g e
Summary The goal of application monitoring is to operationally answer a simple question: Is the application running efficiently within its defined SLA parameters and without errors? If the answer to this question is no, then the operational support team needs to be made aware of this condition as soon as possible. Effective placement of monitoring on critical application and system breakpoints will help manage the hosted solutions. Overview and Background This monitoring guide provides information on the monitoring tools available for all applications hosted in MSCOM data centers. It serves as a guideline for project teams (project managers, developers, test, and operations) during various stages of the project to assist those teams to ensure that monitoring is fully covered as part of the final deployment. Project teams are encouraged to become familiar with the information provided, outline critical aspects of current or new projects, and work with Operations to build monitoring around that system. This document is being published and shared after frequent requests to externally share this information more broadly. Customers can leverage any or all of the best practices shared by the MSCOM team. System Center Operations Manager: How It Works Microsoft System Center Operations Manager 2007 R2 provides end-to-end monitoring for the enterprise IT environment. System Center Operations Manager can monitor thousands of servers, applications, and clients and provides comprehensive views of their health states. These views are key to a rapid and agile response to events that can impact the availability of services the IT department provides its customers. Event Monitors Operations Manager captures a wide variety of system and application events from Windows -based systems distributed throughout an enterprise information technology (IT) environment and aggregates them into a central event repository. Administrators can consolidate these events for an overall view of server and service availability, or they can obtain specific information from the detailed event stream all from a single view on the desktop. Performance Threshold Monitoring System Center Operations Manager can be set to monitor key performance thresholds. Rules can be customized and new rules added, allowing system and application performance trends to be monitored both for historical reporting purposes and capacity 3 P a g e
planning. In addition, local and aggregated thresholds can be set to generate alerts and actions in response to any changes in system or application performance requiring administrative intervention. User Perspective Monitoring System Center Operations Manager allows administrators to create synthetic transactions that act like a user of the service and report back the success or failure and performance statistics of its execution. The synthetic transaction results can be used for reporting or as an alert to possible service problems. This feature tests a server at the application layer to determine whether the server is available. This tool will be used to monitor valid URLs, look for various HTTP status codes, or timeouts on URL execution. It has the ability to generate transaction-based tests (tests that would require user input for successful execution) and can monitor for web service Ping levels as well as normal HTTP response codes. Alerts Any System Center Operations Manager rule can be configured to generate specific alerts with associated severity levels. How Errors Are Tracked Any alert thrown is collected into a central repository and pumped to the tier 1 engineering team (24/7) for assessment. From that point, the team will resolve the problem following Technology Support Group (TSG) instructions or escalate the matter to the on-call engineer who owns that infrastructure. Performance data collection is done against a master listing of performance monitor objects/counters every 90 seconds. The data is stored in a central repository for realtime monitoring views and historical trending purposes per server, cluster, and property. How Operations Uses These Tools Operations works closely with teams to ensure that their applications are tested according to site availability requirements. Operations cannot guarantee that the monitoring will cover all aspects of the application, but because resources will not allow a test for every page or function in a given application, Operations can modify or add testing if gaps are found. Your systems engineer will help you determine the best tools to use for your application. User Perspective Monitoring with System Center Operations Manager Whenever possible, application developers should develop a test page (or pages) that test the core functionality of the application during the normal software development cycle. System Center Operations Manager calls this test page. The status code returned 4 P a g e
with the result of the test page determines the primary health of the application. This page must be able to initialize any code and touch back-end dependencies. The test page must: Return an HTTP 200 message on success. Return an HTTP code that can be distinguished on error conditions (that is, greater than 599). Return the HTTP result description (not the text on the page, necessarily) describing the error. Be relatively lightweight (especially if you plan on testing it frequently). Event Monitors and System Center Operations Manager As part of managing exceptions within an application, the application should write exceptions to the Windows application event logs. At a high level, use three large buckets to classify application events to the application event log: Log using the Error type any application events that require some form of immediate action by Operations Log using the Warning type application events that do not require immediate action but need to be fed back into the development cycle for potential code changes or longer-term investigation by Operations. Write application events that are useful during development, testing, or debugging sessions. An application configuration file setting must control generation of these event types. The Microsoft.com Operations team will disable these events via the configuration file setting when Release Management deploys the application to the production environment. These events must be raised with the type set to Information. Each event log must: Use a unique value for SourceName. Use a unique event ID. Be less than 4 KB in size and include: Module, assembly, class, and method name, as appropriate. The error message. Whether a message or file was involved, including its identity. If appropriate, a referring URL or some method of identifying who or what called the service. Sufficient information to help Operations to determine what the problem is and/or what caused it. 5 P a g e
Use one event ID as a catch-all for the unexpected errors. Do not write events to support metrics or statistics generation. Do not include passwords in any error description. No application should regularly spam the application event log with a flood of events. The Microsoft.com Operations team considers applications that generate more than 250 events per server per day spammers and may disable monitoring the application event log until the project team addresses the problem. Application developers must document a course of action for any event of type Error. Microsoft.com Operations considers it a best practice if an application developer documents all application events of any type. The System Center Operations Manager deployment is capable of taking action for a systems engineer if there is a simple fix such as cycling the application pool for the application when it sees the error. As part of the deployment, the Development team must provide documentation that lists the following information for each event ID assigned to their application: Event ID SourceName Error type (error, informational, debug mode only) Example event log entry Problem description Resolution steps Message displayed to users, if applicable (For user interface [UI] errors, the UI may display a user-friendly error message that will be different from what is written to the application event log.) Monitoring Best Practices Good monitoring best practices come from identifying critical features and dependencies of an application and creating monitoring hooks around them. These built-in monitoring hooks enable faster issue-resolution times when documented by project teams and configured for monitoring by Operations. Best practices for event logging and reporting include: Write only actionable events to the event log. Anything informational or warnings should not be written as an error into the logs. Where additional non-actionable information needs to be collected, allow a flag to be set on demand (such as a configuration file), where information or warning 6 P a g e
entries can be gathered. (This flag could be used in situations where application errors need to be traced from the time an application starts to run.) Ensure that the combination of event ID and event source is always unique. Ensure that event sources have the application name as a qualifier (event source = ApplicationName_ErrorReadData) in order to avoid conflicts with other applications on the same server that have the same feature. Where possible, ensure that event text is descriptive enough that appropriate action can be taken. Document event IDs and event sources used for an application, and provide troubleshooting steps to resolve these errors. This is the key for monitoring and resolving issues with quick turnaround times for that application. In situations where an error is generated a number of times, allow the application to write every tenth occurrence (a value that can be configured through a configuration file) of that error to the application event log. Note: The first time an error occurs, it is always written to the log. Subsequent, similar errors are then incremented. This behavior ensures that an event log is not full with the same error, resulting in loss of other valuable information. Use unique (custom) event logs per application, where required. Do not log personally identifying information (PII) to the event logs even as errors. Where it is required, allow for a flag to be set to ensure Operations intervention. Best practices for application monitoring pages and reporting include: A monitoring page should test for the success of a critical feature (or features) or dependencies and report via an HTTP status code: 200 = Success, >599 for an application-specific failure. A monitoring page can be used to create a multi-step test in which each step can test a specific piece of functionality. Each step can then return an HTTP status code of 200 upon success or a non-599 HTTP status code upon failure of that step. Document each of the monitoring test steps in the monitoring page, related HTTP status codes, and the action to be taken when a specific non-200 status code is returned to the monitoring pages. Apart from returning a non-200 status code, monitoring pages can be used to write events into the application event log to provide more specific information about the error that can be used for further troubleshooting purposes. 7 P a g e
A monitoring page representing critical features and dependencies can be used to report on the overall availability of the system, if required. For More Information For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information through the World Wide Web, go to: Microsoft Corporation http://www.microsoft.com System Center Operations Manager http://www.microsoft.com/systemcenter/en/us/operations-manager.aspx Initializing and Configuring Diagnostic Data Sources http://msdn.microsoft.com/en-us/library/ee843890.aspx 8 P a g e