P ERFORMANCE M ONITORING AND A NALYSIS S ERVICES - S TABLE S OFTWARE WP3 Document Filename: Work package: Partner(s): Lead Partner: v1.0-.doc WP3 UIBK, CYFRONET, FIRST UIBK Document classification: PUBLIC Abstract: This document describes the stable version of software components of the performance monitoring and analysis services conducted in WP3 the K-WfGrid project. We describe WP3 architecture and its main software components including GEMINI, DIPAS and DR. PUBLIC 1 / 13
Delivery Slip Name Partner Date Signature From Hong-Linh Truong UIBK 31/08/2006 Verified by Piotr Nowakowski CYFRONET 09/10/2006 Approved by Steffen Unger FIRST 10/10/2006 Document Log Version Date Summary of changes Author 0.1 20/08/2006 First version Hong-Linh Truong, Thomas Fahringer 0.2 22/08/2006 Update all chapters Hong-Linh Truong, Bartosz Balis 0.3 25/08/2006 Revise all chapters Hong-Linh Truong, Thomas Fahringer 0.4 31/08/2006 Revise all chapters Hong-Linh Truong, Thomas Fahringer 1.0 10/10/2006 QA check Piotr Nowakowski PUBLIC 2 / 13
CONTENTS 1. INTRODUCTION... 4 1.1. ABBREVIATIONS AND ACRONYMS... 4 2. FEATURES OF... 5 3. IMPLEMENTATION STRUCTURE... 6 3.1. ARCHITECTURE OF... 6 3.2. EXAMPLE OF USE CASE... 6 3.3. GEMINI: GRID PERFORMANCE MONITORING AND INSTRUMENTATION... 7 3.4. DIPAS: DISTRIBUTED PERFORMANCE ANALYSIS SERVICE... 9 3.5. DR: DATA REPRESENTATIONS AND SERVICES INTERFACES... 11 4. REFERENCES... 13 PUBLIC 3 / 13
1. INTRODUCTION This document outlines the stable design and specification of software components of the performance monitoring and analysis services and interfaces conducted in WP3. While, previously, we described in detail the specification of software components of performance services and interfaces in the deliverable D2_5.1 appendix, that deliverable is not the final, stable design and specification because many things have been changed over the time of the project. Through the software development (prototype and stable phases), the design and specification have been refined. We have identified changes that should be made in order to fit WP3 software components into the whole project s purpose. We have also received many feedbacks about performance monitoring and analysis services as well as their interfaces that help improve the usability of WP3 components in K-WfGrid. This document, therefore, reflects the current, stable software components of WP3. On the one hand, it can be considered as a supplement to the WP3 s D2_5.1 appendix. On the other hand, it is delivered as the document for the stable software version. For detailed software manuals for both user and developer, we provide six accompanied manuals: Performance Analysis Service - User Manual [DIPASUSER] Performance Analysis Service - Developer Manual [DIPASDEV] Performance Service Interfaces and Data Representation - Development Manual [DRDEV] Performance Service Interfaces and Data Representation - User Manual [DRUSER] GEMINI - User Manual [GEMINIUSER] GEMINI - Developer Manual [GEMINIDEV] These six manuals are updated version of the software prototype manuals which initially are appendixes to D2_5.2. They include the detailed manuals for the stable software version. Therefore, this document and the six above-mentioned manuals constitute the D2_5.3 deliverable (planned at month 28 th ) for the stable version of WP3 s components. 1.1. ABBREVIATIONS AND ACRONYMS Abbreviation DIPAS DR GOM GWES KAA PDQS SIRWF WARL WIRL WSRF Description Distributed Performance Analysis Service Performance Service Interfaces and Data Representation Grid Organizational Memory Grid Workflow Execution Service Knowledge Assimilation Agent Performance Data Query and Subscription Standardized Intermediate Representation for Workflows Workflow Analysis Request Language Workflow Instrumentation Request Language Web Services Resource Framework PUBLIC 4 / 13
2. FEATURES OF The performance monitoring and analysis services in the K-WfGrid project aim at providing online information about the performance of Grid workflows as well as Grid resources involved in the workflow execution. Such performance information provides not only to the workflow developer/user insights into the execution of workflows but also to the K-WfGrid middleware and services, especially the Scheduler, KAA (Knowledge Assimilation Agent), knowledge about the performance for semiautomatically constructing and executing workflows. The main features of WP3 performance monitoring and analysis services are Instrumentation and monitoring of Grid workflows composed of WSRF/Web Services Support monitoring and performance analysis and visualization of Grid sites, networks and workflows in a comprehensive and unified system Performance analysis and monitoring as Grid services which provide performance information to semi-automatically construction and execution of workflows. Support analysis of workflow overheads based on a systematic classification of performance overheads for Grid applications and search processes for performance problems based on performance constraints Performance, monitoring and event data can be represented in a well-defined XML format, thus interaction with other services is substantially simplified. Instrumentation and measurement requests are issued dynamically via an XML interface. PUBLIC 5 / 13
3. IMPLEMENTATION STRUCTURE 3.1. ARCHITECTURE OF The performance monitoring and analysis services for Grid infrastructure and applications in K- WfGrid include three main components GEMINI: Performance Monitoring and Instrumentation Service DIPAS: Distributed Performance Analysis Service DR: Performance Data Representations and Interfaces GEMINI and DIPAS are integrated into a single environment to support performance monitoring and analysis of Grid workflows and infrastructure while DR provides data representations and interfaces that describe performance data, events, requests and response among DIPAS, GEMINI, as well as other clients (e.g., Scheduler, KAA) of WP3. Figure 1: WP3 Performance Monitoring and Analysis Services Figure 1 depicts the final, stable design of components of WP3 in the context of the K-WfGrid project. GEMINI is responsible for the instrumentation and monitoring of workflows invoked by GWES and of Grid resources involved in the workflow execution. Based on features provided by GEMINI, DIPAS, including a set of portlets/portlet services, DIPAS Gateway and DIPAS Portal, will conduct the performance analysis and visualization of Grid workflows and infrastructure. The user interacts with WP3 through the DIPAS portal. External services, such as Scheduler and KAA, which need performance data, will interface with GEMINI and DIPAS using well-defined service interfaces and data representations. All performance data, requests and responses are described in XML, provided by a set of XML schemas in the DR component. Performance data representations and service interfaces include event descriptions, languages for specifying instrumentation requests, performance data subscription and query, and performance constraints and overhead analysis, etc. Sections 3.3, 3.4, and 3.5 will briefly describe GEMINI, DIPAS and DR, respectively. 3.2. EXAMPLE OF USE CASE PUBLIC 6 / 13
One example of use cases of WP3 deals with on-line performance monitoring and analysis of workflows. Figure 2: Use case for online performance monitoring and analysis of workflows Figure 2 shows the use case of conducting performance monitoring and analysis of an existing workflow. After connecting and logging in to the portal, the user can start the performance monitoring and analysis of workflows. Firstly, the user requests the DIPASGateway to provide existing workflow IDs. Given a set of workflow IDs, the user can select an existing workflow, currently running or completed. And then, the user can start the monitoring the workflow. The workflow description will be retrieved from GWES and the workflow graph is visualized. At the same time, GEMINI will deliver workflow execution events to DIPAS. The execution statuses of the workflow activities will be updated in the graph when the execution status changes. During the execution of the workflow, the user can conduct different types of analyses. Part of analyses will be conducted at the portal. The rest of analyses will be conducted by DIPASGateway To fulfil the requests from the user, the DIPASGateway will interact with the other services, such as GEMINI, KAA, GWES, and GOM, involved in the performance monitoring and analysis, collect relevant data, and conduct the performance analysis. 3.3. GEMINI: GRID PERFORMANCE MONITORING AND INSTRUMENTATION Figure 3 shows the current architecture of GEMINI in the context of other system entities. Monitors and Sensors are the actual GEMINI components. Usually there is one Monitor per site in the system, and a number of underlying Sensors per Monitor. Sensors register themselves in Monitors, publishing the resources they handle with corresponding metrics they compute for the resources. Monitors publish the resource-metric pairs in the K-WfGrid GOM. External clients, such as DIPAS, through GOM discover GEMINI Monitors which provide requested metrics for resources of interest, and send monitoring request to Monitors to actually obtain monitoring data. Each Monitor exposes two Web Service interfaces: Monitoring interface based on PDQS language. Instrumentation interface based on WIRL. Through the monitoring interface, any clients can obtain monitoring data using query or subscribe modes. In query mode, the requestor is blocked until the requested data is returned. In subscribe mode, PUBLIC 7 / 13
the requestor specifies the time period in which the requested monitoring data should be asynchronously delivered. In subscribe mode the monitoring data is returned via a non-ws publish/subscribe channel based on ICE (Internet Communication Engine) technology. PAS GOM Sensor Monitor GWES Monitor Monitor Sensor Sensor Sensor Host OCM-G GMetad Ganglia GMond GMond Application Host Host Monitor Sensor GEMINI components site boundaries publish / discovery Monitoring request and data flow Figure 3: Context Architecture of GEMINI Instrumentation interface is used to control the instrumentation of applications, which is a necessary step before monitoring of applications is possible. Through WIRL requests, one can perform a number of operations, most important ones are: Request an intermediate representation of the monitored application, called SIRWF (Standard Intermediate Representation for Workflows). SIRWF allows the user to identify the parts of applications to be instrumented and specify the corresponding metrics (currently start/end events). Enable or disable the instrumentation for a given code region. Currently, GEMINI provides tools to automatically instrument Java byte-code classes. After the instrumentation, a SIRWF description and sensors are inserted into the code. Though this instrumentation is inserted statically, sensors execution is conditional and can be enabled or disabled at runtime. Consequently, the overhead of inactive instrumentation is negligible. GEMINI also handles, through adaptation of the OCM-G monitoring system, monitoring and dynamically enabled instrumentation of MPI applications written in C. GEMINI provides two general types of sensors: autonomous and embedded. Autonomous sensors are deployed as independently running processes, and usually are demand-driven, i.e. provide data on request. Embedded sensors are probes inserted into a code and usually provide data in an event-driven fashion, i.e. only when the instrumented code is invoked. Sensors describe monitoring data in XML, using schemas provided by DR. In the stable version, a number of sensors are implemented which support monitoring of both Grid infrastructure and workflow applications: PUBLIC 8 / 13
Applications sensors to instrument software components. Currently the workflow enactment engine GWES is statically instrumented to obtain workflow-level events such as workflow and activity initialization, start of an activity, etc. Also, a number of sample applications are instrumented semi-dynamically. Infrastructure sensors which adapt an underlying infrastructure monitoring system Ganglia to GEMINI. MPI application sensor adapting the OCM-G monitoring system to GEMINI. GEMINI handles a number of non-functional issues important in the Grid: Security. The integrity and privacy of monitoring data is ensured by authentication through X.509 user certificates. Usage of proxies (ICE Glacier) to transport data streams minimizes security risks of necessary open ports. Firewall traversal and private networks. Monitoring of resources behind firewalls or in private networks is possible thanks to the usage of ICE Glacier proxy. Usage of Web Service interfaces for client requests also addresses this problem. 3.4. DIPAS: DISTRIBUTED PERFORMANCE ANALYSIS SERVICE DIPAS (Distributed Performance Analysis Service) controls the monitoring and instrumentation service, conducts the performance analysis of workflows at runtime, and provides performance metrics proposed by the metric ontology to clients. This prototype is a collection of performance monitoring and analysis tools that can be accessed by the user through a web browser, supporting the user to login from anywhere via the Internet with minimum installation effort and from virtually any platform to conduct the performance monitoring and analysis of Grid infrastructure and workflows in a user friendly way with many options available. The tools collect the data dynamically from services running in distributed Grid sites, analyze performance and monitoring data, and visualize the performance results in the portal. Figure 4: Architecture of DIPAS PUBLIC 9 / 13
Figure 4 depict the DIPAS which includes three main parts: DIPAS portal, DIPAS Portlet/PortletServices, and DIPAS Gateway. The DIPAS portal provides a single place for the user to conduct the performance monitoring and analysis of Grid infrastructure and workflows. The content of DIPAS portal is provided by DIPAS portlet/portlet services. Moreover, the portal also includes a Java applet implementing the main GUI of performance analysis and visualization of workflows. The applet visualizes the monitoring data of the workflows and partially analyzes the performance of the workflows. It also controls the DIPASGateways (see below) to perform the workflow overhead analysis and displays the resulting overhead analysis to the user. The Java applet is configured supporting Java Plug-in mechanism [JAVA-PLUGIN]. The portal is deployed into a web container based on Tomcat [TOMCAT]. The DIPAS Portlet/Portlet Services: they are implemented by using Gridsphere [GRIDSPHERE]. The portal interacts with portlets and portlet services which process user requests through web interfaces and generate contents displayed in the portal. Portlets/Portlet Services are used to provide monitoring data of Grid infrastructure to the portal. The DIPAS Gateway is a GT4 WSRF [WSRF] service that acts as a mediator between the portal and various services (e.g., GOM, GEMINI, GWES) involved in the performance monitoring and analysis. Moreover, it implements the overhead analysis and search for performance problems. DIPAS provides various performance analysis and visualization features, including Online performance analysis and visualization of workflow executions Overhead analysis for workflows Search for performance problems The online performance analysis and visualization of workflows is conducted by using either a Web portal or a standalone Java application. The performance analysis and visualization of workflows supports the user to visually observe the trace of execution phases for arbitrary workflow activities. Execution time of activities or activity instances can be compared. Activity distribution among Grid sites can be determined. The workflow overhead analysis is based on a novel classification of workflow overheads. Overheads are classified into sub categories such as middleware overheads, load imbalance, data transfers, etc. Based on the monitoring data provided by GEMINI, DIPAS will analyze workflows, workflow PUBLIC 10 / 13
activities, activity instances, etc., and determine detailed overheads associated with them. Overhead analysis can be conducted online and the resulting overheads are visualized in a tree view. The search for performance problems points out to the user performance problems that occurred during runtime. The user can specify performance conditions, for example, if execution time is larger than 10 minutes then the performance tool should inform the user. Based on these performance conditions, the tool will check performance metrics of workflow activities and instances. If any performance condition holds, then the tool will send back to the user the performance problem which includes detailed information about where the problem occurred. Performance problems will be visualized in the performance analysis GUI. In the stable version, DIPAS Portal is implemented based on GridSphere and Java applet technology. DIPAS Gateway is a WSRF-based service, implemented using Globus Toolkit 4.0 [GLOBUS]. 3.5. DR: DATA REPRESENTATIONS AND SERVICES INTERFACES This document describes performance service interfaces and data representations (DR) implemented in WP3 in the framework of the K-WfGrid project (hence DR refers to the performance service interfaces and data representations in K-WfGrid.). DR contains XML-based data representations and languages, and OWL-based ontologies defined for Describing monitoring and performance data and specifying requests for controlling the performance data query and subscription. Describing the structure of workflow applications and specifying requests for controlling the instrumentation of workflows. Specifying performance analysis requests. Describing information about performance data and services, representing ontological form for performance data associated with Grid workflows [WFPERFONTO]. DR provides (standardized) representations for data and requests and it is used by various services which involve in performance analysis and monitoring, e.g., DIPAS and GEMINI, as well services which utilize performance and monitoring data, e.g. KAA and Scheduler. DR helps simplifying the interaction and integration among various K-WfGrid Grid services. Table 1 shows the list of XML-based data representation and languages in the stable version. Category Name Description Data service.available Availability of services Representation host.cpu.used CPU usage of a computational node host.mem.used host.system.loadavg path.delay.roundtrip path.bandwidth.capacity.tcp host.sysinfo app.prof app.event wfa.event Memory usage of a computational node Load average of a computational node Roundtrip delay of a network path measured at IP level TCP bandwidth of a network path System information of a computational node Profiling data of applications Application events Workflow events PUBLIC 11 / 13
Requests SIRWF Standardized intermediate representation for workflow-based application PDQS Performance data query and subscription WIRL Workflow instrumentation request language WARL Workflow analysis request language Table 1: XML-based data representations and requests Table 2 shows the list of OWL-based ontologies developed as part of the stable version. Name Description WfPerfOnto WfMetricOnto Mondata Ontology describing performance data associated with workflows Ontology describing performance metrics Ontology describing information about available performance and monitoring data Table 2: OWL-based ontologies for performance and monitoring data PUBLIC 12 / 13
4. REFERENCES [APPLET] Sun Developer Network, http://java.sun.com/applets/ [DIPASUSER] K-WfGrid Consortium, Performance Analysis Service - User Manual, August, 2006. [DIPASDEV] K-WfGrid Consortium, Performance Analysis Service - Developer Manual, August, 2006. [DRDEV] K-WfGrid Consortium, Performance Service Interfaces and Data Representation - Development Manual, August, 2006 [DRUSER] K-WfGrid Consortium, Performance Service Interfaces and Data Representation - User Manual, August, 2006 [GEMINIUSER] K-WfGrid Consortium, GEMINI - User Manual, August, 2006 [GEMINIDEV] K-WfGrid Consortium, GEMINI - Developer Manual, August, 2006 [GLOBUS] Globus Toolkit. http://www-unix.globus.org/toolkit/ [GRIDSPHERE] The Gridsphere Portal Framework. http://www.gridsphere.org/gridsphere/gridsphere [JAVA PLUG-IN] Sun Developer Network, http://java.sun.com/j2se/1.4.2/docs/guide/plugin/index.html [TOMCAT] The Apache Software Foundation, http://tomcat.apache.org/ [WFPERFONTO] Hong-Linh Truong, Thomas Fahringer, Francesco Nerieri, Schahram Dustdar Performance Metrics and Ontology for Describing Performance Data of Grid Workflows, IEEE International Symposium on Cluster Computing and Grid 2005 (CCGrid2005), 1st International Workshop on Grid Performability, IEEE Computer Society Press, Cardiff, UK, 9-12 May 2005. PUBLIC 13 / 13