INTELLIGENT ALARMS DETECTION FOR THE ANALYSIS OF SYSTEM FAULT IMPACT ON BUSINESS Pace, C. (1), Russo, I., Fernández, V., Rossi, B. & García Martínez, R. Abstract The tools for fault impact analysis are important for the deployment of critical mission systems. These tools can be also used as a development phase aid. In this pages we introduce several concepts related to "business alarms". Business alarms are an approximation to the company's business conceptual scheme driven by the business rules from systems conceptual schemes. In order to specify them we propose the utilization of Knowledge Engineering typical techniques. Company's High Dependence in Computer Systems: A Definition of Critical Mission Systems. Our business reality shows us the company's strong dependence of the business processes in their support computer systems and the high level of confidence in the reliability of the technology. Not far way in the past a computer system was the replacement -or complementfor a manual workflow. There was always the option of returning to the old methods if the system fails. Today is completely different: A computer system replaces an older computer system, there is no way back, we don't have the choice to return to a "blood traction" system. Between the computational systems dependent business process we find the one we will call: "Critical Mission". A pragmatic "Critical Mission" definition can be the following: "A Critical Mission System is the one that if it doesn't work or it doesn t' perform properly the main service provided by the company cannot operate" [MASSA, 1997] From de definition previously expound it spring up that the availability of company s services it s going to depend on the availability of critical mission systems. For the implementation of any successful diagram of high availability for critical mission applications is necessary the consciousness of the company s management and it s necessary to be qualified for changing in the generalizations that were been given for a context [Bateston 1979]. Objectives The object of alarm detection for impact analysis of fault business systems is to reduce the breach between business controls and typical control system. For it, furthermore specifying an alarm visualization system for the direction responsible of the systems, we are going to specificity the alarm transmission to a center with following purpose:
Development support Help desk with an end user problems and resolution expert database Contingency maneuvers coordination The rush in the projects makes that most of the time the errors are not taken into account into the development phase, and a transactional structure for handling them is added later. Sometimes this structure is not completely implemented and the error handling is left to the database engines and operating systems mechanisms. The fact of the existence of a development support center can facilitate and allow the adding of code chunks with the purpose of centralized debugging. The support center support will have a rule based main kernel that will do a nexus between development and help desk to final users. This kernel will allow: Give a greater quality help to final users. Relate user faults with systems faults in the development manager format. Moreover, the own nature of the objectives, this eases the acceptation of the knowledge engineer in the organization (one of the primary steps to make possible knowledge acquisition). It s going to be a company s direction responsibility to present him as the developer of this center and help desk for development and final users. Description of the Knowledge Engineer Role. Sometimes it s hard to explain to the analysts cannot explain the knowledge engineer role because they may see his function as a possible risk to their jobs. The principal difference is that the analyst treat with the whole organization and systems necessities but a knowledge engineer treats with the experts [Brule 1989]. The interactions between experts and knowledge engineer are represented according to the following diagram: Critical Mission System Expert Networking Expert Critical Mission Systems Expert Knowledge Engineer Platform Expert
A knowledge engineer doesn t relate directly with the whole organization, he does it through the interface provided by an expert in critical mission systems The different experts must be familiarized with the definition based on business of critical mission systems, adopted by the direction. The direction systems of a typical company has got at least the Following departments: Development Networking Platforms The expert in critical mission systems in accordance with the knowledge engineer will resolve the management with multiple experts with different conceptual diagrams working over the same dominion; for it, we are going to use the following reference frame [SHAW, 1989] Consensus: Experts use in a same way terminology and concepts. Conflict: Experts use the same terminology for different concepts Correspondence: Experts use different terminology for the same concepts Contrast: Experts do not agree with their terminology and concepts. An expert in critical mission systems must have the following profile: Knowledge and pledge in according with the activities that take part in the main company s business. Experience on multi-platform environments. Experience on organization and technologies changes. Management, access and control on every support resource. Responsibility and capacity management during possible contingence systems. For this expert back boxes are non existent, every system appearance that Could affect business processes of the company, there are his own problem; for him there is not existence of processes and responsibilities separation. In our case the steps of a knowledge engineer are: a) Isolate main and necessary company s business process and associated computer systems. b) Formalize and document the present and necessary redundancies. c) Identify company s business alarms and their formation rules and dependencies. Business A business alarm is a parameter that shows us if each of the critical mission systems is failing in a direct or indirect way, causing an interruption in our main business.
Each business alarm is the result of the encapsulation of computer systems alarm by critical mission task [MASSA, 1996] Business Manager Business Encapsulation by Critical Mission Tasks Network Platform Network Manager Manager Platform Manager These alarms can be viewed from a development support center, where they will be coordinated by business alarms managers. They may act in a reactive or proactive way, in accordance with their incumbency and profiles for each case. As demonstration of this concept, a prototype was implemented for the encapsulation and viewing of business alarms, according a possible specification given by an expert in critical mission systems. This program runs under Windows and is the responsible on the conversion of common alarms into the different formats of business alarms. The prototype works with the following kind of alarms: SNMP Conversion of host alarms into business alarms Code embedded alarms.
The business alarms are consolidated in a time-indexed database. Each kind of alarm has it s own formation rules. They are shown graphically or by an automated procedure. The program also features e-mail interface, and alarm and notification retransmission to field experts. This tool can also be used as an arbitrating tool and registry. The program was built using the common steps for alarm software programming. The program modules are: Synchronization Auditory Information Capture Consolidation Analysis Viewing Registry One of the most important modules is the one we will call software alarm proxy. This proxy interprets and translates the alarms to the standards used by the network or the hardware platforms. Proprietary Proxy Standard Other Encapsulation by Critical Mission Task Identification and utilization of points of reinforcement and redundancies. During his work in the company the Knowledge Engineer will identify the point where the conditions of availability must be reinforce and where the redundancies must be implemented. We will spring up two types of redundancies: Redundancy of localized Systems Redundancy of distributed systems The first one consists in the implementation of systems that are at least duplicated in the original site. It s desirable that the system has an automatic recovery procedure. Distributed systems redundancy consists in the duplication of the original system in a remote site. For this, a data coherence assurance mechanism must be provided.
Site A Second Order Redundancy Relationship Site B Redundancy and Load Balancing Coherence Negotiation Contingency Startup Maneuvers The expert must know each redundancy type and know how to identify real redundancies to fictitious ones. Redundancies in a computer system can be: Basic Service Redundancy: without this the system cannot operate, they are for example-: the power systems, temperature and humidity controls, etc. Platform Redundancy: Can be given by several different platforms or by hardware with internal redundant architecture. Network Redundancy: For each of the system s interconnections, we must have logical and physical different ways or circuits. This redundant ways can be sharing and balancing the load in nominal system state, but they must be originally dimensioned in a manner that the peak activity is allowed with half of the circuits. Redundancy: The main purpose of all the already mentioned redundancies is to keep the software redundancy. redundancy is the hardest redundancy to get. There are several issues about control and data coherence to be taken into account; these are dependent of the inner architecture of each software module. SOFTWARE REDUNDANCY NETWORK REDUNDANCY PLATFORM REDUNDANCY SUPPORT REDUNDANCY Conclusion and Results The role of the knowledge engineer covers a wide spectrum in a critical mission environment, from the support of the original design to the startup and tuning of the end user s help desk.
Due to the strong relationship between time and alarms, we verified the importance of having and unique time for all the company s computer systems, or provide the necessary mechanism in order to adjust the time registry of each separate system. The formalization of rules and knowledge in the area of critical mission systems allows the company to make predictions of the dimensioning and correct forecasting of the systems taking into account their critical nature. As results of the processes of systematization and formalization of knowledge we have: Rules for the platform specification in order to forecast a scheme of high availability and contingency. We documented the need of the creation of a software alarms and a development support center based in this alarms The cost issues related to the startup of an end user help desk were verified. At last, a prototype program with demonstration purpose was build in order to show the exposed concepts. Future Development Issues Once build the database structure and identifying the attributes in this way: Exogenous: Network,, and Platform. Endogenous: Business. Claims registered in the help desk system. We can add to the prototype automated alarm rule extraction modules using several algorithms such as: ID3, C4.5, OC1 or Sipina, for later confronting the results with the expert s opinion. References [BATESON, 1979] G. Bateson. Mind and nature: a necessary unity, New York: Dutto, 1979. [BRULE, 1989] [GAINES, 1987] [MASSA, 1996] James F. Brulé and Alexander Blount. Knowledge Adquisition. New York: McGraw-Hill, 1989. Brian R Gaines. Canadian Engineering Centennial Convention: Proceedings of Electrical Engineering Sessions. IEEE 87TH0186-7, pp.42-49, 1987. Eduardo Julio Massa. Critical mission applications for open system architectures. International Systems Audit and Control Association meeting. Chicago, 1996.
[MASSA, 1997] [SHAW, 1989] Eduardo Julio Massa - Procesos de Misión Crítica y Procesamiento de Información en Red en Arquitecturas Abiertas. Estrategia de Controles por aplicación de Misión Crítica. Conferencia Internacional de CISA, 1997. Mildred L G Shaw and Brian R Gaines. Comparing Conceptual Structures: Consensus, Conflict, Correspondence and Contrast. Knowledge Acquisition 1(4), 341-363, 1989.