Monitoring and Diagnostic Guidance for Windows Azure hosted Applications

Monitoring and Diagnostic Guidance for Windows Azure hosted Applications Published: June 2010 Overview: This monitoring and diagnostics guide provides information on the tools available for applications hosted on the Windows Azure platform. It serves as a guideline for project teams (project managers, developers, Test, and Operations) and to assist those teams to ensure that monitoring and diagnostics are fully covered as part of the final deployment. Project teams are encouraged to become familiar with the information provided, outline critical aspects of current or new projects, and work with operations to build monitoring around that system. Author: Brian Copps 2010 Microsoft Corporation. All rights reserved. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Azure, and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Table of Contents TABLE OF CONTENTS... 2 SUMMARY... 3 SOLUTION OVERVIEW... 3 HOW ERRORS ARE TRACKED... 3 HOW OPERATIONS USES THESE TOOLS... 3 MONITORING OPTIONS... 4 EVENT MONITORS... 4 Troubleshooting Documentation... 6 USER PERSPECTIVE MONITORING... 7 PERFORMANCE MONITORING... 8 WINDOWS AZURE DEVELOPER AND PROJECT GUIDANCE... 9 DIAGNOSTIC API (DIAGNOSTIC DATA COLLECTION)... 9 USING THE TRACESOURCE CLASS FOR TRACE (EVENT) LOGGING... 10 AZURE STORAGE GUIDANCE AND POLICIES... 10 APPENDIX... 12 ENABLE WINDOWS AZURE DIAGNOSTICS DATA SOURCES... 12 USING THE TRACESOURCE TO LOG EVENTS... 13 Configuration file sections (configure the Windows Azure trace listener for this specific trace source).. 13 C# code (use the TraceSource to log events)... 13 Web.Config Setting for Failed Request Tracing... 13 FOR MORE INFORMATION... 15 2 P a g e

Summary The goal of application monitoring is to operationally answer a simple question: Is the application running efficiently within its defined SLA parameters and without errors? If the answer to this question is no, then the Operations team needs to be made aware of this condition as soon as possible. Effective placement of monitoring on critical application and system breakpoints will help manage the hosted solutions. This document is intended for development and operations teams that need to monitor applications hosted on the Windows Azure Services Platform, enabling incident, problem, and knowledge management. Solution Overview This monitoring and diagnostics guide provides information on the tools available for applications hosted on the Windows Azure platform. It serves as a guideline for project teams (project managers, developers, Test, and Operations) and to assist those teams to ensure that monitoring and diagnostics are fully covered as part of the final deployment. Project teams are encouraged to become familiar with the information provided, outline critical aspects of current or new projects, and work with Operations to build monitoring around that system. How Errors Are Tracked Any alert thrown is collected into a central repository and pumped to the Tier 1 engineering team (24/7) for assessment. From that point, the team will resolve the problem following Technology Support Group (TSG) instructions or escalate the matter to the Tier 2 or Tier 3 engineer who owns that infrastructure. How Operations Uses These Tools Operations works closely with teams to ensure that their applications are tested according to site availability requirements. Operations cannot guarantee that the monitoring will cover all aspects of the application, but because resources will not allow a test for every page or function in a given application, Operations can modify or add testing if gaps are found. Your systems engineer will help you determine the best tools to use for your application. This document is being published and shared after frequent requests to externally share this information more broadly. Customers can leverage any or all of the best practices shared by the MSCOM team. 3 P a g e

Monitoring Options Windows Azure is a cloud services operating system that serves as the development, service hosting and service management environment for the Windows Azure platform. Windows Azure provides developers with on-demand compute and storage to host, scale, and manage web applications on the internet through Microsoft datacenters. Microsoft System Center Operations Manager 2007 R2 provides end-to-end monitoring for the enterprise IT environment. System Center Operations Manager can monitor thousands of servers, applications, and clients and provides comprehensive views of their health states. These views are key to a rapid and agile response to events that can impact the availability of services the IT department provides its customers. Event Monitors Based on the Windows Azure platform offerings and the security context used to run applications, two sources are used for events that in turn generate alerts for the 24/7 help desk: Windows event logging to the application log Microsoft.NET TraceSource event logging Note: At this time, the Source field cannot be used to make events unique (along with the event ID) or to identify the source of the event. The Microsoft.NET TraceSource does not include the concept of a Source field, and for the application event log, custom sources cannot be created because of the lack of administrative privileges required to register a new source. For now, the Event ID must uniquely identify an application or component, and a mapping must to be provided to Operations. As a best practice, include the name of the application or component in the body of the message. Moving forward, for applications developed specifically for Windows Azure, Microsoft.NET TraceSource event logging should be used to leverage the following Windows Azure fabric provisions: Development fabric trace integration for developers A deeper level of event verbosity control for events (See TraceEventType Enumeration in the Microsoft. NET Framework Class Library.) 4 P a g e

For older applications already using the application event log, monitoring is still possible if the event IDs do not overlap. Use the TraceSource class for trace logging, because it supports inserting distinct event IDs. For more information, see the section, Using the TraceSource to Log Events. The information that the application logs is used to trigger actionable alerts for Operations. In context, actionable means that Operations has enough information in the alert and/or the TGS guide to solve the incident. The event ID is used by event-based monitors. Product groups document which event IDs the monitors use to enable filtering so that only events used by the monitors defined in the Management Pack are inserted into Microsoft System Center Operations Manager 2007. The development team should provide list of event IDs to be monitored so that the IDs can be configured for alerting. Operations engineers should work with the Management Platform & Services Delivery (MPSD) Tools and Monitoring team to configure the monitors. At a high level, use three large buckets to classify application events in the application event log: Log using the Error type any application events that require immediate action by Operations. Log using the Warning type any application events that are actionable or need attention but dot require immediate action by Operations. By default, application events with a severity type lower than Warning will not be persisted to storage. (The Windows Azure diagnostics agent filter verbosity level is set for Warning.) As a debug procedure, the Windows Azure diagnostics agent filter verbosity level can be temporarily changed to a lower severity type, enabling storage of events with lower severity types. Each event logged must: Use a unique value for Event ID. Be less than 4 KB in size and include: Module, assembly, class, and method name, as appropriate. The error message. Whether a message or file was involved, including its identity. If appropriate, a referring URL or some method of identifying who or what called the service. 5 P a g e

Sufficient information to help Operations to determine what the problem is and/or what caused it. One event ID as a catch-all for unexpected or unhandled errors. Events logged must not: Write events to support metrics or statistics generation. Include passwords in any error description. Include personally identifiable information (PII). Spam the application event log with a flood of events. Note: The MPSD Tools and Monitoring team considers applications that generate more than 250 events per server per day spammers and may disable monitoring the application until the project team addresses the problem. Troubleshooting Documentation Application developers must document a course of action for any event of type Error. Operations considers it a best practice if an application developer documents all application events of any type. As part of the deployment, the development team must provide documentation that includes the following information for each event ID assigned to their application: Event ID Source (application, class, and so on) Error type (error, informational, debug mode only) Example event log entry Problem description Resolution steps Message displayed to users, if applicable; for user interface (UI) errors, the UI may display a user-friendly error message that will be different from what is written to the event log. Best practices for event logging and reporting include: Anything informational or a warning should not be written as an error into the logs. Where additional non-actionable information needs to be collected, allow a flag to be set on demand (such as a configuration file) where information or warning entries can be gathered. This flag could be used in situations where application errors need to be traced from the time an application starts to run. For Windows Azure, use the diagnostics agent filter verbosity level to determine the event types that are persisted in table storage. 6 P a g e

Ensure that event IDs are unique. Ensure that event sources, which are inserted into the event message, have the application name as a qualifier (event source = ApplicationName_ErrorReadData) in order to enable Operations to quickly identify where the error is coming from. Where possible, ensure that event text is descriptive enough that appropriate action can be taken. Document event IDs used for an application, and provide troubleshooting steps to resolve these errors. This is the key for monitoring and resolving issues with quick turnaround times for that application. In situations where an error is generated a number of times, allow the application to write every tenth occurrence (a value that can be configured through a configuration file) of that error to the event log. Note: The first time an error occurs, it is always written to the log. Subsequent, similar errors are then incremented. This behavior ensures that an event log is not full with the same error, resulting in loss of other valuable information. Do not log PII in the event logs even as errors. Where it is required, allow for a flag to be set to ensure Operations intervention. User Perspective Monitoring System Center Operations Manager allows you to create synthetic transactions that act like a user of the service and report back with the success or failure and performance statistics of its execution. The synthetic transaction results can be used for reporting or as an alert to possible service problems. This feature tests a server at the application layer to determine whether the server is available. This tool will be used to monitor valid URLs, looking for various HTTP status codes or timeout on the URL execution. The tool has the ability to generate transactionbased tests (tests that would require user input for successful execution). Whenever possible, application developers should develop a test page (or pages) that test the core functionality of the application during the normal software development cycle. System Center Operations Manager calls this test page. The status code returned with the result of the test page determines the primary health of the application. This page must be able to initialize any code and touch back-end dependencies. The test page must: Return an HTTP 200 message on success. Return an HTTP code that can be distinguished on error conditions (that is, greater than 599). 7 P a g e

Return the HTTP result description (not the text on the page, necessarily) describing the error. Be relatively lightweight (especially if you plan on testing it frequently). Best practices for application monitoring pages include: A monitoring page should test for the success of a critical feature (or features) or dependencies and report via an HTTP status code: 200 = Success, >599 for an application-specific failure. A monitoring page can be used to create a multi-step test in which each step can test a specific piece of functionality. Each step can then return an HTTP status code of 200 upon success or a non-599 HTTP status code upon failure of that step. Document each monitoring test step in the monitoring page, related HTTP status codes, and the action to be taken when a specific non-200 status code is returned to the monitoring pages. Apart from returning a non-200 status code, monitoring pages can be used to write events into the application event log to provide more specific information about the error that can be used for further troubleshooting purposes. A monitoring page representing critical features and dependencies can be used to report on the overall availability of the system, if required. Performance Monitoring This feature is currently not available. Project teams will be notified when such monitoring is available. 8 P a g e

Windows Azure Developer and Project Guidance Diagnostic API (Diagnostic Data Collection) A separate storage account will be created to store diagnostic data, ensuring that application and monitoring data is separated and can be accessed independently. The Windows Azure Web and Worker roles must be instrumented to enable collection for the Windows Azure diagnostic data sources shown in Table 1. Table 1. Windows Azure Diagnostic Data Sources Data source Windows Azure Details Stored on platform setting Windows Azure logs Enabled Requires that trace listener be added to web.config or application.config: WADLogsTable (table) <system.diagnostics> The ScheduledTransferPeriod is set to 1 minute. The Windows Azure diagnostics agent filter verbosity level will be set for Warning (and higher). Windows event logs Enabled Events from application and system event logs The ScheduledTransferPeriod is set to 1 minute. WADWindowsEventLogs Table (table) The Windows Azure diagnostics agent filter verbosity level will be set for Warning (and higher). See Appendix IIS 7.0 Logs Enabled The ScheduledTransferPeriod is set to 10 minutes. wad-iis-logfiles (blob container) IIS7 Failed Request logs Enabled Enable tracing for all failed requests with status codes 400 599 under the system.webserver section of the role's web.config file. wad-iis-failedreqlogfiles (blob container) The ScheduledTransferPeriod is set to 10 minutes. See Appendix Performance counters Enabled Enable logging for performance counters. Set the SampleRate and ScheduledTransferPeriod to 5 minutes. WADPerformanceCounte rstable (table) See Appendix 9 P a g e

The Windows Azure platform provides this functionality, and it does not require additional development work. For more information about the Windows Azure diagnostic data sources, see Initializing and Configuring Diagnostic Data Sources. The diagnostic information is stored on the dedicated storage account Operations Storage. Using the TraceSource Class for Trace (Event) Logging When using Microsoft.NET TraceSource event logging to log events, use the TraceSource class, because it supports event IDs. Using the generic Trace.Write and Trace.WriteEvent causes the event ID to be 0 for all events, which goes against the guidance in the Event Monitoring section. See Using the TraceSource to Log Events in the Appendix for C# code. Azure Storage Guidance and Policies A separate storage account is created to store diagnostic data. The diagnostic data stored in Windows Azure storage is used for monitoring as well as to create application baselines. Table 2 shows the retention policy. Table 2. Windows Azure Storage Retention Policy Data Source Windows Azure platform setting Retention period Stored on Windows Azure logs Windows event logs Enabled 1 week Windows Azure WADLogsTable (table) Enabled 1 month Windows Azure WADWindowsEventLogsTable (table) IIS 7.0 logs Enabled 1 week Windows Azure wad-iis-logfiles (blob container) IIS 7.0 failed request logs Performance counters Enabled 1 week Windows Azure wad-iis-failedreqlogfiles (blob container) Enabled 1 month Windows Azure WADPerformanceCountersTable (table) For some data sources, the size of the data can be estimated in advance (performance counters); for other sources (Microsoft Internet Information Services [IIS} logs), it cannot. 10 P a g e

These are the minimal retention requirements; they can be change based on business or operational requirements. Application-specific data is not stored in the same store as the diagnostic data and is not subject to this policy. 11 P a g e

Appendix Enable Windows Azure Diagnostics Data Sources public override bool OnStart() { //Get Default Config DiagnosticMonitorConfiguration config = DiagnosticMonitor.GetDefaultInitialConfiguration(); //Windows Performance Counters List<string> counters = new List<string>(); counters.add(@"\processor(_total)\% Processor Time"); counters.add(@"\memory\available Mbytes"); counters.add(@"\tcpv4\connections Established"); counters.add(@"\asp.net Applications( Total )\Requests/Sec"); counters.add(@"\network Interface(*)\Bytes Received/sec"); counters.add(@"\network Interface(*)\Bytes Sent/sec"); foreach (string counter in counters) { PerformanceCounterConfiguration counterconfig = new PerformanceCounterConfiguration(); counterconfig.counterspecifier = counter; counterconfig.samplerate = TimeSpan.FromMinutes(5); } config.performancecounters.datasources.add(counterconfig); config.performancecounters.scheduledtransferperiod = TimeSpan.FromMinutes(5); //Windows Event Logs config.windowseventlog.datasources.add("system!*"); config.windowseventlog.datasources.add("application!*"); config.windowseventlog.scheduledtransferperiod = TimeSpan.FromMinutes(1); config.windowseventlog.scheduledtransferloglevelfilter = LogLevel.Warning; //Azure Trace Logs config.logs.scheduledtransferperiod = TimeSpan.FromMinutes(1); config.logs.scheduledtransferloglevelfilter = LogLevel.Warning; //Crash Dumps CrashDumps.EnableCollection(true); //IIS Logs config.directories.scheduledtransferperiod=timespan.fromminutes(10); DiagnosticMonitor.Start("DiagnosticsConnectionString", config); // For information on handling configuration changes // see the MSDN topic at http://go.microsoft.com/fwlink/?linkid=166357. RoleEnvironment.Changing += RoleEnvironmentChanging; } return base.onstart(); 12 P a g e

Using the TraceSource to Log Events Configuration file sections (configure the Windows Azure trace listener for this specific trace source) <system.diagnostics> <sources> <source name="mytracesource" switchname="sourceswitch" switchtype="system.diagnostics.sourceswitch"> <listeners> <add type="microsoft.windowsazure.diagnostics.diagnosticmonitortracelistener, Microsoft.WindowsAzure.Diagnostics, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35" name="azurediagnostics"> <filter type="" /> </add> </listeners> </source> </sources> <switches> <add name="sourceswitch" value="warning"/> </switches> </system.diagnostics> C# code (use the TraceSource to log events) public static class Logging { private static TraceSource Ts = new TraceSource("MyTraceSource", SourceLevels.Warning); public static void Write(TraceEventType tracetype, int eventid, string message) { //Log event Ts.TraceEvent(traceType, eventid, message); } } Web.Config Setting for Failed Request Tracing <tracing> <tracefailedrequests> <add path="*"> <traceareas> <add provider="asp" verbosity="verbose" /> <add provider="aspnet" areas="infrastructure,module,page,appservices" verbosity="verbose" /> <add provider="isapi Extension" verbosity="verbose" /> <add provider="www Server" areas="authentication,security,filter,staticfile,cgi,compression,cache,requestnotifica tions,module" verbosity="verbose" /> </traceareas> <failuredefinitions statuscodes="400-599" /> </add> </tracefailedrequests> 13 P a g e

</tracing> 14 P a g e

For More Information For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information through the World Wide Web, go to: Microsoft Corporation http://www.microsoft.com System Center Operations Manager http://www.microsoft.com/systemcenter/en/us/operations-manager.aspx Windows Azure Platform http://msdn.microsoft.com/en-us/library/dd163896(v=msdn.10).aspx Initializing and Configuring Diagnostic Data Sources http://msdn.microsoft.com/en-us/library/ee843890.aspx 15 P a g e