Cognitive Model-Based Autonomic Fault Management in SDN

Transcription

1 Cognitive Model-Based Autonomic Fault Management in SDN Sungsu Kim

2 Doctoral Thesis Cognitive Model-Based Autonomic Fault Management in SDN Sungsu Kim (김성수) Division of Electrical and Computer Engineering (Computer Science and Engineering) Pohang University of Science and Technology 2013

3 인지모델을 이용한 SDN 장애관리에 관한 연구 Cognitive Model-Based Autonomic Fault Management in SDN

4 Cognitive Model-Based Autonomic Fault Management in SDN by Sungsu Kim Division of Electrical and Computer Engineering (Computer Science and Engineering) Pohang University of Science and Technology A dissertation submitted to the faculty of the Pohang University of Science and Technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Division of Electrical and Computer Engineering (Computer Science and Engineering). Pohang, Korea June 27, 2013 Approved by Prof. James Won-Ki Hong

5 Cognitive Model-Based Autonomic Fault Management in SDN Sungsu Kim The undersigned have examined this dissertation and hereby certify that it is worthy of acceptance for a doctoral degree from POSTECH. June 27, 2013 Committee Chair James Won-Ki Hong Member Hwangjun Song Member Jae-Hyoung Yoo Member Chansu Yu Member Jangwoo Kim

6 DECE Sungsu Kim, 김성수, Cognitive Model-Based Autonomic Fault Management in SDN. 인지모델을 이용한 SDN 장애관리에 관한 연구, Division of Electrical and Computer Engineering (Computer Science and Engineering), 2013, 84P, Advisors: James Won-Ki Hong. Text in English. ABSTRACT In the past decades, network technologies have rapidly evolved with respect to their conception, coverage, capacities and complexities. Advanced technologies, such as software agents and policy-based network management, have been developed for more efficient network management. However, the development of network management technologies has not kept pace with the rapid evolution of the network technologies. Autonomic network management, which helps the network itself detect, diagnose and repair failures by adaptive configuration changes based on the context, is a solution approach for advanced network management. Autonomic network management is first and foremost, a method to manage complexity. Autonomic network management systems can lead to important business advantages, such as reduction of operational expenditure through task automation. Unfortunately, autonomic network management is still an immature technology that poses many challenges to be solved before successful deployment. In order to be a real autonomic management system, not just a automated system, the system should have enough capacities to adapt itself to changing situations. Autonomic network management systems have relied on control loops to change their behaviors based on observations of situations. Existing control loops provide only a single routine to handle situations, regardless of the nature of problems. Although most

7 urgent problems require speedy solutions, some problems need the best solution notwithstanding its response time. Especially, existing control loops have no idea how to deal with unexpected situations which are not defined by policies. In this thesis, we propose a novel autonomic network management architecture, called CogMan, which is based on a cognitive model for efficient problem resolving and accurate decision making even in unexpected situations. A cognitive control loop of CogMan provides reactive, deliberative, and reflective loops for managing systems based on an analysis of the current status. In order to validate the proposed architecture and the control loop, we apply them to fault management in Software Defined Networks (SDN), as well as legacy networks. We also propose a Fast Flow Setup (FFS) algorithm for fast failure recovery in SDN. Finally, this thesis presents the results of experiments on which managing various failure situations with the proposed architecture and algorithm are evaluated. 7

8 Contents I INTRODUCTION Background Problem Statements Research Approach Thesis Organization II RELATED WORK Network Management Approaches Management Information Models for Knowledge Representation Control Loops for Autonomic Systems Fault Management Failure Recovery in SDN Alarm Correlation Techniques Related Projects Knowledge Plane and 4D AutoI WARD FAME Chapter Summary i

9 III Cognitive Network Management Architecture Overall Architecture of CogMan Knowledge Representation The Cognitive Control Loop Hierarchical Autonomic Management Architecture Chapter Summary IV Fault Management in SDN Introduction Detailed Fault Management Processes Prototype Implementation Evaluation Chapter Summary V Autonomic Fault Diagnosis based on Association Rule Mining Introduction Multiple Control Flows based on the Priorities of Alarms Association Rule Mining Determination of Alarm Priorities Evaluation Chapter Summary VI CONCLUSIONS AND FUTURE WORK Summary Contributions Future Work Chapter Summary REFERENCES 80 ii

10 List of Figures II.1 Manager-agent model II.2 Policy-based network management framework II.3 P2P based network management architecture II.4 IBM autonomic control loop II.5 Boyd s Observe-Orient-Decide-Act loop II.6 FOCALE autonomic architecture II.7 The CHIP architecture II.8 OpenFlow group table II.9 Path protection III.1 Conceptual architecture of CogMan III.2 Conceptual model of the cognitive control loop III.3 The cognitive control loops III.4 Block diagram of autonomic element III.5 Example of autonomic element hierarchy IV.1 Reactive, deliberative, and reflective cases for managing failures IV.2 Example of a causality graph IV.3 Example of restoration IV.4 Fields used for FFS IV.5 Example of FFS IV.6 A CogMan prototype: fault management IV.7 Recovery time of UDP flow when a single failure occurs iii

11 IV.8 Recovery time (multiple failures) IV.9 Control packet exchange ratio and traffic generated during failure recovery (multiple failures) IV.10Number of required packet exchanges for flow setup V.1 Multiple control flows based on the control loop V.2 Example of alarm generalization V.3 Example of a dependency model V.4 Ontological model for alarms and priority V.5 Experimental Topology V.6 Synthetically Generated Alarms from Each Device V.7 Clustered Set of Alarms iv

12 List of Tables V.1 Alarm transaction data set V.2 Alarms and detected association rules v

13 List of Algorithms 1 Normalize (observed) Compare (clusterset) FastFlowSetup(packet) vi

14 Listings vii

15 Chapter I INTRODUCTION This chapter provides a brief introduction to autonomic network management. In this chapter, we will list main problems in current autonomic network management and outline our proposed approaches. 1.1 Background Network technologies have rapidly evolved with respect to their conception, coverage, capacities and complexities. The traditional data networks have been replaced by more elaborate ones with emerging new network technologies and applications with more complex requirements, such as Voice over IP (VoIP), real-time video streaming, and online game services. However, network management technologies have not caught up with the rapid evolution of the network technologies. In early days, network management systems were dependent on Command Line Interfaces (CLI), which enabled network administrators to query and manually re-

16 I. INTRODUCTION 2 configure network elements. For efficient network management, advanced technologies, such as software agents [1] and policy-based network management [2], have been developed. Although the new technologies for network management provide partial automation, decision-making in network management still lacks understanding of situation and adaptability due to their embedding static nature. In addition, traditional management approaches cannot handle the ever increasing complexity of managing a large number of devices that use sophisticated and heterogeneous technologies. Autonomic network management, which helps the network itself to detect, diagnose and repair failures by adaptive configuration changes based on the context, is a solution approach for the aforementioned problems [3]. Autonomic network management is first and foremost, a way to manage complexity. If autonomic network management systems can perform manual, time consuming tasks, then these actions will save operators and administrators valuable time, enabling them to perform higher level cognitive network functions, such as network planning and optimization. They will also lead to important business advantages, such as reduction of operational expenditure through task automation. Autonomic network management is inspired by a human nervous system that manages different body processes without human consciousness [4]. Autonomic network management approaches have been more decentralized and less deterministic, as they apply the mechanisms of biological systems [5]. Unfortunately, autonomic network management is still immature and many challenges should be resolved for successful deployment. Challenges for autonomic network management are representation of semantics and knowledge, bio-inspired behavior model, and control loops

17 I. INTRODUCTION 3 for self-management (i.e., self-configuration, self-healing, self-protection, and selfoptimization) [6]. To be a real autonomic network management system, not just a automated network management system, the system should have adaptability to the changing situation. Autonomic network management approaches broadly rely on control loops for a network to change its behaviors by itself based on observations of its current state. Existing control loops only provide limited functionalities for unexpected situations, as they resolve problems based on predefined policies. For successful deployment of autonomic network management systems, there has been a call for more sophisticated and advanced methodologies for improving situation-awareness and adaptability to changing environments. 1.2 Problem Statements Autonomic network management systems broadly rely on feedback or control loops to adapt their behaviors based on observations of the current state. IBM proposed the Monitor-Analyze-Plan-Execute (MAPE) control loop [7] and Strassner proposed the FOCALE control loop [8]. These control loops provide functionalities such as self-configuration, self-healing, self-optimization, and self-protection. Existing control loops define building blocks and control logic to maintain a managed system in a desired state. Existing control loops provide a single routine for control and heavily depend on policies specified by administrators to handle situations. Situations that have no matching policy, therefore, cannot be handled by the control loop logic. resulting in situations where management system is not able to resolve the

18 I. INTRODUCTION 4 unanticipated problems. To achieve a satisfactory level of adaptability, control loops should be able to handle situations that the system does not have predefined plans for the situations. Autonomic network management is often incorrectly characterized as one or more of the following self-functions: self-configuration, -healing, -optimization, and -protection. These are the benefits of autonomics, not defining features. In order to implement any of these four functions, the system must first be able to know itself and its environment what its capabilities are, and what is demanded of it. This is called self-knowledge, and is the basis for Situated Autonomic Communications [9]. Specifically, the ability to learn and incorporate knowledge at runtime is crucial to harnessing the power of emergent behavior for enabling components and systems to dynamically self-organize [10]. There have been many autonomic network management architectures and research projects, but only few have validated their methods by resolving real network management problems better than existing methods. Therefore, in this thesis, we validate our proposed autonomic network management concept by applying it to fault management in SDN (Software-defined Networks) [11], as well as traditional networks. We also apply the proposed autonomic management architecture to manage faults in traditional networks, we focus on the SDN case in this thesis. Since SDN separates control and data planes, it provides more controllability to administrators and an efficient decision making process is more important. Open Networking Foundation (ONF) [12] defines OpenFlow [13] as the first standard communications interface defined between the control and forwarding layers of an SDN architecture. Failure recovery in SDN has different aspects to that of traditional networks.

19 I. INTRODUCTION 5 Restoration and protection are well-known failure recovery mechanisms in OpenFlow networks [14]. As the controller redirects affected flows one by one in restoration, it takes a long time for recovery if many flows are affected. Protection sets a backup path in advance, like Multiprotocol Label Switching (MPLS) path protection. This mechanism provides fast failure recovery that meets a failure recovery requirement of carrier grade networks, 50 ms [15]. However, there are cases that protection cannot cover, such as multiple failures that affect both working and backup paths. When there is no available backup path, restoration is the only possible method but its recovery time is relatively long. As we have shown in the aforementioned failure management case, network problems cannot be resolved by a single solution. Based on analysis of the current situation, the management system should be able to provide the most appropriate solution. In this thesis, we concentrate on the following key questions. What are the limitations of the current network management approaches? What features of autonomic network management should be investigated to handle situations adaptively? How can we design scalable management architecture to support a growing number of users, devices, and services? Are the existing failure recovery mechanisms in SDN sufficient for managing various failure scenarios? How can we manage multiple failures as well as a single failure in SDN? How can we reduce control overhead of an SDN controller?

20 I. INTRODUCTION Research Approach In this section, we explain our solutions to the problems mentioned in chapter 1.2 and the research methodologies. In order for more adaptable control loops, we applied a human cognition model to our proposed control loop. We propose and design new autonomic network management architecture called Cognitive network Management (CogMan), which is based on the proposed control loop. We also present how our proposed architecture and methods are applied to fault management in SDN. We propose a Fast Flow Setup (FFS) algorithm for fast failure recovery in SDN. Pertaining to these questions, the following items are the goals of this thesis. This thesis will state the main problems with current network management that our approach solves. This thesis will propose a cognitive control loop which employs a model of human intelligence for adaptive response to changes in the managing network. This thesis will propose hierarchical autonomic management architecture to manage large number of managed resources with a number of managing nodes. This thesis will investigate existing failure recovery mechanisms in SDN. This thesis will propose failure recovery method to recover multiple failures in SDN. This thesis will validate the proposed architecture and method by implementing a prototype system and applying it to SDN fault management.

21 I. INTRODUCTION Thesis Organization The organization of this thesis is as follows. Chapter II describes autonomic network management and failure recovery in SDN as related work. In Chapter III, we describe the cognitive control loop, and management architecture of CogMan. In Chapter IV, we present detailed processes for managing fault in SDN including the proposed algorithm and then show the experiment results for evaluation of the proposed methods. In Chapter V, we present fault diagnosis applying proposed architecture and simulation results for validating the proposed methods. Finally, Chapter VI concludes the thesis with a summary, contributions, and possible future work.

22 Chapter II RELATED WORK In this chapter, we discuss network management approaches from traditional mangeragent to peer-to-peer (P2P)-based network management models. We then present management information models for managing heterogeneous devices and existing control loops are compared to the proposed cognitive control loop. We then describe fault management in SDN, as well as traditional networks. Finally, network management programs and projects are presented. 2.1 Network Management Approaches Traditionally, various network management techniques, such as Simple Network Management Protocol (SNMP) [16] and Telecommunication Management Network (TMN) [17] have been used to manage communication networks. However, SNMP does not provide any liaison between business requirements and technology. Also, SNMP architecture is not able to reconfigure managed elements automatically. TMN

23 II. RELATED WORK 9 is based on object oriented approach and its protocol stack is comprehensive but it makes the protocol stack heavy and brings more complexity. TMN agents are also dumb and have no intelligence to make important decisions. Web based technology such as WBEM were developed [18]. Both traditional and current distributed technologies face interoperability issues due to different standards and uncommon models. WBEM is not only a solution to persisting interoperability issues, but it also enhances management capabilities by abstraction and decomposition of business process and services. However, even if WBEM helps to solve interoperability problems, it does not have intelligent decision-making capability. Figure II.1: Manager-agent model Figure II.1 describes the manager-agent model, which defines the principles of operation for management solution. Managed resources are modeled as Managed Objects (MOs). SNMP is typical example for the manager-agent model. While SNMP is popular and has been used for most industry, it is difficult to meet emerging requirements of network management including interoperability among heterogeneous

24 II. RELATED WORK 10 networks, QoS guaranteed services, and management of increasing number of network devices and users. Figure II.2: Policy-based network management framework Internet Engineering Task Force (IETF) [19] provided the policy-based network management framework [20] as shown in Figure II.2. The policy-based network management framework is composed of three parts: a dedicated policy repository, Policy Decision Points (PDPs), and Policy Enforcement Points (PEPs). The policy repository stores policies and PDPs distribute the policies to PEPs. The policy-based management framework has a feature of distribution which is important as the size of managed network increases. The management functions are distributed to PDPs to avoid a heavy traffic caused by management. P2P networks are overlay networks composed of nodes located in edges of the Internet. Some researchers have been employed P2P model for network management. P2P based network management improves traditional network management,

25 II. RELATED WORK 11 Figure II.3: P2P based network management architecture for example, load balance of management tasks at management nodes, cooperative management, and increased reliability. Figure II.3 presents a P2P-based network management architecture proposed in [21]. Top-level Manager (TLM) is a peer that is in charge of both reacting to the requests from human operators and communicating with other management peers for the accomplishment of a particular management task, Mid-level Manger (MLM) is a peer that is responsible for reacting to the requests from the TLM and other MLMs. As shown in Figure II.3, many management nodes which have management functionalities are distributed. Therefore, management tasks are distributed to MLM or TLM nodes. Furthermore, the number of managed devices per each MLM can be controlled to support scalability. However, P2P-based network management also gives additional expenses. In terms of network usage, P2P-based approach needs higher bandwidth consumption with

26 II. RELATED WORK 12 more information exchange compared to traditional approaches. Also, there is additional overhead for managing management nodes, such as TLMs and MLMs. Our approach focuses on minimizing information exchange between management nodes and supporting scalability. 2.2 Management Information Models for Knowledge Representation An autonomic principle has been applied to network management in order to overcome drawbacks of the traditional network management techniques. An autonomic principle is a way to reduce complexity [22]. If an autonomic system can perform manual, time-consuming tasks (such as device configuration changes in response to simple problems), then these actions will save operators and administrators valuable time, enabling them to perform higher level cognitive network functions, such as network planning and optimization. Representation of knowledge is a fundamental part of an autonomic principle. Traditional network management architecture, such as SNMP only describes low-level monitoring data. Since vendors organize their tables and write their Management Information Base (MIB) very differently, it is therefore difficult to find the data that is being searched for. Information model is necessary to solve interoperability problem of disparate data produced by different vendors. We use of information and ontological modeling to capture knowledge relating to network capabilities, business goals and policies. This enables reasoning and learning for enhancing and evolving knowledge. There are information models such as CIM [23], SID [24], DEN-ng [25] which are conceptual information models for describing

27 II. RELATED WORK 13 computing and business entities in Internet, enterprise and service provider environments. We use DEN-ng to represent knowledge in CogMan for several reasons. While DEN-ng has a well-developed context model, the CIM and SID do not have it. Context is critical for selecting policies applicable for adapting environmental conditions change. Also, the CIM and the SID do not provide any mechanisms to orchestrate behavior, such as a finite state machine, which DEN-ng provides. 2.3 Control Loops for Autonomic Systems IBM was a pioneer of autonomic management of IT resources and they defined the MAPE control loop [7]. As shown in Figure II.4, sensors and effectors get data from and provide commands to both the entity being managed and to other Autonomic Managers. The Autonomic Manager is an implementation that automates a management function. The knowledge source implements a repository that provides access to knowledge according to the interfaces of the Autonomic Manager. Boyd s Observe-Orient-Decide-Act (OODA) loop [26] was originally conceived for military strategy, but has had multiple commercial applications as well. It describes the process of decision-making as a recurring cycle of four phases: observe, orient, decide, and act, as shown in Figure II.5. Unlike the MAPE loop, this is not a sequential loop. First, stopping observation while the analysis is continuing is not effective in terms of performance. Second, a balance must be maintained between delaying decisions and performing more accurate analysis that eliminates the need to revisit previously made decisions. FOCALE is self-governing, in the system senses changes in itself and its envi-

28 II. RELATED WORK 14 Figure II.4: IBM autonomic control loop ronment, and determines the effect of the changes on the currently active set of business policies. As shown in Figure II.6, the FOCALE control loops operate as follows. Sensor data is retrieved from the managed resource (e.g., a router) and fed to a model-based translation process, which translates vendor- and device-specific data into a normalized form in XML using the DEN-ng information model and ontologies as reference data. This is then analyzed to determine the current state of the managed entity. The current state is compared to the desired state from the appropriate Finite State Machines (FSMs). If no problems are detected, the system continues using the maintenance loop; otherwise, the reconfiguration loop is used so that the services and resources provided can adapt to these new needs. There are some problems with above control loop architectures. The control loops are gated by the monitoring function. Hence, if too much data floods the system, the performance of the rest of the system suffers, even if the monitored data is not

29 II. RELATED WORK 15 Figure II.5: Boyd s Observe-Orient-Decide-Act loop relevant. Secondly, all the cases are processed using same routine, so there is no consideration on the priority or characteristic of a network problem. Third, it is impossible for them to reason about what actions to be taken, if a situation that it encounters has not been anticipated. Minsky [27] developed a model of human intelligence which is built using simple processes, called agents, which interact according to three layers, called reactive, deliberative, and reflective. A cognitive architecture [28], which is for building a system that attempts to encompass the full range and magic of human cognition, can be a good reference to make network system more intelligent and effective. This is shown in Figure II.7. Reactive processes take immediate responses based upon the reception of an appropriate external stimulus. Deliberative processes receive data from and can send command to the reactive process. Reflective processes supervise the interaction between the deliberative and reactive processes. By using a human

30 II. RELATED WORK 16 Figure II.6: FOCALE autonomic architecture cognition model, it is able to handle situations that have never been anticipated and solve network problems adaptively. 2.4 Fault Management In this section, we first describe existing failure recovery mechanisms in SDN. We then present existing alarm correlation techniques for fault diagnosis. Alarm correlation techniques can be used for SDN fault management, as well as traditional networks Failure Recovery in SDN Fault management of OpenFlow networks differs in a number of aspects from that of traditional networks. In traditional networks, distributed networking devices, such as routers, reconstruct routing paths with changed topology information and update

31 II. RELATED WORK 17 Figure II.7: The CHIP architecture routing tables when a link fails. In OpenFlow networks, however, routing decisions are made by a centralized controller. The controller detects topology changes, decides on a route to send packets, and sets up flow tables of switches along the route. There are two well-known failure recovery mechanisms in OpenFlow networks, restoration and protection [14]. Restoration matches well with the OpenFlow principle that a network is controlled by a centralized controller. If a failure occurs, the controller calculates alternative paths for every affected flow and sets up new flow entries to switches along the new path. Therefore, the time taken to recover from a failure with the restoration mechanism is proportional to the number of affected flows and the length of the new path. Considering that the failure recovery require-

32 II. RELATED WORK 18 ment in a carrier-grade network is 50 ms, the failure recovery time of the restoration mechanism is not acceptable for the provision of services on the network [14]. Secure Channel to the controller Flow Entry Flow Entry Group Entry Group Entry Group Entry Group Table Flow Flow Entry Group ID Group Type Action Buckets Flow Entry Flow Entry Flow Table 0 Flow Table 1 Flow Table n Figure II.8: OpenFlow group table To solve this problem, OpenFlow 1.1 [29] introduces a fast failover group entry, which provides fast rerouting. Protection can be implemented using a group table. A group table concept is described in Figure II.8. The group table consists of group entries that contain action buckets. When a packet arrives at a switch, a matching flow entry is examined first. If there is a corresponding group entry, the packet is redirected to the corresponding group entry. A group entry has a fast failover group type that is used for path protection, and the fast failover type group entry has a set of action buckets. The first action bucket is used when the switch is operating under normal conditions and the other action buckets can be used when the switch

33 II. RELATED WORK 19 output port of the first action bucket is unavailable. Flow table of switch A src dst Outport Failover port a b 1 2 a A C B D 1 E Original path: <ABE> New path: <ADE> b Flow table of switch D src dst Outport Failover port a b 1 none Figure II.9: Path protection Figure II.9 shows path protection using a group table. When a new packet, with source and destination nodes a and b arrives at switch A, the controller sets the working path ABE by installing normal flow entries in switches A, B, and E. In addition to the original working path, protected path ADE is installed to switches D and E. For switch A, failover type group entry is added to redirect flows when port 1 is unavailable. As a result, switches B and D have normal flow entry, switch A has a normal flow entry and a group entry for protection, and switch E has two normal flow entries. Protection provides faster failure recovery than path restoration since switches can handle failures by themselves without the controller. However, if multiple failures occur and both working and backup paths are affected by the failures, the affected flows cannot be redirected to the backup paths. For example, in Figure II.9, packets cannot be delivered from host a to host b if link(a,

34 II. RELATED WORK 20 B) and link(a, D) fail simultaneously. We propose the FFS algorithm, which enables recovery from failures in situations where no backup path has been prepared. FFS provides fast flow redirection by reducing packet exchanges between a controller and switches. Our proposed control loop uses different loops based on the situation. In a single failure scenario, we use the reactive or deliberative loops with protection, while multiple failure scenarios are handled by the reflective loop, which employs the proposed FFS algorithm Alarm Correlation Techniques There are four alarm correlation approaches, rule-based alarm correlation [30], codebookbased alarm correlation [31], case-based alarm correlation, mining based alarm correlation [32, 33]. However, Rule-based, codebook-based, and case-based approaches are highly dependent on expert knowledge of skilled operators. Especially, it is not easy to reflect dynamically changing network condition such as wireless or overlay environments because rules or dependency models are made manually based on the assumption that network is mostly stable. Mining based alarm correlation is able to detect the cause and effect relationships between alarms. However, it is hard to detect relationships in a short period of time because of its long processing time. Our method used both rule-based and mining based approaches. Efficiency can be taken from the rule-based approach and dynamic changing relationships are detected by mining based approach.

35 II. RELATED WORK Related Projects In this section, we briefly describe related projects and programs for management of networks Knowledge Plane and 4D The motivation for the Knowledge Plane [34] was the call for a new architecture that would maintain the strengths of the current Internet architecture while addressing its weaknesses. This is evidenced by the vast number of extensions being made to the Internet, most of which are not architectural in nature, but rather point solutions to specific problems that use the Internet (e.g., NATs, VPNs, and firewalls) and/or apply only to one managed network within the Internet (e.g. QoS on one autonomous system). However, Knowledge Plane did not specify how to work with heterogeneous technologies and devices; no advances in building a common model or in mapping from vendor-specific data to a common normalized data were defined. In addition, by combining the management and control planes, the Knowledge Plane fails to consider heterogeneous networks (e.g., wired and wireless networks) that use different control planes and mechanisms. The Data, Discovery, Dissemination, and Decision Plane (4D) [35] is a cleanslate design that changes the IP control plane in order to better control network-wide objectives. As the name implies, it consists of four separate planes. More importantly, this addresses one of the key shortcomings of the current Internet design, which is the inability to separate the different layers that perform different functions. In contrast, our approach (1) provides more sophisticated learning and reasoning algorithms,

36 II. RELATED WORK 22 (2) emphasizes context-awareness, so that the resources and services provided at any given time are determined by the current context as well as applicable business goals and user needs, (3) explicitly defines and uses a mechanism to translate between different vendor-specific management data, and (4) maintains compatibility with the current Internet. Note, however, that some of the ideas of 4D, and their protocol implementation, can be accommodated by our approach AutoI The goal of the Autonomic Internet (AutoI) project [36] is to develop a self-managing virtual resource overlay that can span across heterogeneous networks and supports service mobility, security, QoS, and reliability under the concept of five planes Orchestration Plane, Service Enablers Plane, Knowledge Plane, Management Plane, and Virtualization Plane. However, it is doubtful that the Orchestration Plane accomplishes its goal as they supposed. The Orchestration Plan consists of Distributed Orchestration Components (DOCs), which control a single orchestration domain, enables the AMSs of the domains to communicate and cooperate with each other. The AutoI described interfaces between DOC and AMS, but they do not specify how they control and manage network resources to meet overall goals and policies of heterogeneous networks. For example, algorithms for inferring policy conflict and solving conflict should be devised. In our CogMan architecture, we use the Policy Continuum [2] to develop mappings between policy rules that correspond to each constituency. This enables context-aware policies to be used to orchestrate behavior for business goal, system level, and other forms of interactions.

37 II. RELATED WORK WARD The 4WARD project [37] has developed a clean-slate network management approach called In-network Management (INM), which is aimed at the management of large, dynamic networks, where a low rate of interaction between an outside management entity and the network will be required. The idea INM is that management tasks are delegated from management stations outside the network to a self-organizing management plane inside the managed system. In other words, INM takes the view that the network element itself is the best entity for understanding how it should be managed; hence, management of the network must be made intrinsic to the operation of the network. This is enabled through decentralization, self-organization, embedding of functionality, and autonomy. Under this paradigm, the managed system executes local or global functions, and reports the results of its actions to an outside management system or triggers exceptions if intervention from outside is needed. However, its information model is not a true information model, but is rather a simple dataoriented mapping of objects. Hence, it does not solve the interoperability problem. We do not believe that devices have the ability to manage themselves; while a router or switch has the physical space to contain such functionality and the logical computing means to support it, this approach is in general not applicable to the vast majority of end nodes that exist, especially those that are mobile-centric and have limited resources.

38 II. RELATED WORK FAME The FAME project [38] addresses the federation of communication networks, as well as autonomic network management. Current research into autonomic communications and autonomic network management emphasize the automation of decisionmaking to reduce operational costs of service providers. However, without addressing the highly dynamic and federated nature of modern service provisioning, this research runs the risk of simply replacing today s network management silos with autonomic network silos. FAME project structured its research into three strands: (1) Federated Communications Service Management, (2) Service Monitoring and Configuration, and (3) Network Infrastructure Coordinated Self-Management. The federation of heterogeneous networks requires for harmonizing different management data from different networks, such as resources being manage, the context used in communication services, and the policies used to express governance. The approach and the methodology used by FAME are similar to ours. FAME uses information models and ontologies extracted from the DEN-ng [25] for the semantic mapping of management data, and the FOCALE [8] is used for their architecture. FAME concentrates on the federation, but we focus more on cognitive control and management of virtualized resources and cloud environment. We use a new FOCALE cognition model for collecting data, managing, and decision making, which enable our system to learn about its own behavior and effectively respond to context changes.

39 II. RELATED WORK Chapter Summary In this chapter, we briefly introduced network management approaches. We also presented information models for knowledge representation in autonomic network management. We then presented control loops and compare with the proposed cognitive control loop. As we validated our proposed architecture by applying it to fault management, we briefly described fault management techniques in SDN and traditional networks. Finally, we reviewed related research projects and programs for network management.

40 Chapter III Cognitive Network Management Architecture In this section, we present an autonomic network management architecture based on a cognitive control loop. First, the overall architecture of CogMan is described. Second, we propose a cognitive control loop for elaborate processing of network problems. Finally, we describe hierarchical management architecture to provide scalability. 3.1 Overall Architecture of CogMan Figure III.1 depicts the conceptual architecture of CogMan, which incorporates a cognitive control loop enabled by a model and ontology [39]. The AM gathers devicespecific data from managed resources. The collected data are fed to the Model-based translation process, which translates vendor-specific data into vendor-neutral data

41 III. Cognitive Network Management Architecture 27 Figure III.1: Conceptual architecture of CogMan using the DEN-ng information model. By this process, CogMan is able to manage network devices regardless of their data models or languages. The AM infers the impact of the information. For example, a single link failure means that users cannot get the level of QoS specified in their Service Level Agreement (SLA). It is therefore determined as a well-known problem that should be processed immediately. The priority and characteristic of the problem affects the decision-making process within the cognitive control loop. When the cognitive control loop deals with a problem, the reactive loop is used if the problem can be processed quickly. For example, a path protection mechanism in MPLS enables switches to change their paths without the intervention of a human operator. The deliberative loop is used when the problem is anticipated but detailed actions should be organized. The reflective loop is used if the problem is not anticipated by the administrators or difficult to be solved by a standard method. It examines how the goals are affected by the problem. Decision-making in the control loops is influenced

42 III. Cognitive Network Management Architecture 28 by policies, which are sent by a policy manager. Human operators give business goals, which are translated into policies in the policy manager via the user interface. Human operators can also create policies directly. A set of appropriate actions from the cognitive control loop, which is composed of vendor-neutral commands, are translated into vendor-specific commands. The AM can manage a single network device or a number of network devices. The deployment policy determines how many network devices can be managed by a single AM. This means that a number of AMs may coexist in the managed network. To support scalability, an architecture that describes how to distribute AMs in the managed network is needed. We use a hierarchical management architecture for cooperation between AMs. The AM can manage a single network device or a number of network devices. The deployment policy determines how many network devices can be managed by a single AM. This means that a number of AMs may coexist in the managed network. To support scalability, an architecture that describes how to distribute AMs in the managed network is needed. We use a hierarchical management architecture for cooperation between AMs. CogMan has three important parts: (1) knowledge representation and organization, (2) the cognitive control loop, and (3) a hierarchical architecture for AMs. 3.2 Knowledge Representation Knowledge is critical to our approach for two different reasons. First, without a single definition of knowledge, it is impossible to solve the interoperability problem

43 III. Cognitive Network Management Architecture 29 that prevents disparate data produced by different vendors to be shared and reused. This problem is exacerbated by the multiple conflicting standards that are present in the telecommunications world as well as by artifacts such as private MIBs. Second, network management has been limited in its power by having to deal with data, instead of having access to machine-readable semantics. For example, dropped data packets can be easily recoverable; dropped control packets could make a problem much worse. Hence, our approach relies on formal models to represent knowledge. An information model is a representation of the characteristics and behavior of a component or system independent of vendor, platform, language, and repository. In contrast, a data model is an implementation of the characteristics and behavior of a component or system using a particular vendor, platform, language, and/or repository. We use a DEN-ng information model [25] and a DENON-ng ontology model in order to translate between the different data models (e.g., a directory and a relational database) used by different applications. Knowledge is defined once in DEN-ng information model, and then bound to one or more specific forms via one or more specific data models by ontological reasoning used to ascertain and discover inconsistencies and incompatibilities between data models. While models are important to represent knowledge, they are not sufficient for the implementation of the proposed architecture. They do not contain the constructs necessary to support the definition of knowledge or reasoning about knowledge, which requires formal semantic definitions. We augment the knowledge contained in models with a set of ontologies. Ontology provides mechanisms to establish semantic relationships between information. In addition, We gain significant benefits by combining modeled and ontological data. Doing so enables us to reason from facts;

44 III. Cognitive Network Management Architecture 30 more importantly, it forms the foundation for mapping between different schemata. It also can be used to strengthen the semantics being inferred. For example, if a hypothesis is formed, we can use additional semantic relationships to gather additional information to help prove or disprove the hypothesis. This in effect provides a self-check of the consistency and integrity of the information that has been collected. This is especially important when multiple data from different sources need to be integrated, since they can have different qualities (e.g., accuracy, certainty, and freshness) as well as different formats. Hence, semantic relatedness enables us to solve two important problems: (1) the harmonization of different qualities of data, and (2) the alignment of different ontologies to facilitate extraction of diverse data. This in turn increases the accuracy and reliability of the proofs that use the ontological data. Our current implementation uses a programmable threshold for the former, which enables us to weight the contribution of different data sources; for the latter, we use standard and custom semantic equivalence relationships. 3.3 The Cognitive Control Loop CogMan is built in accordance with the FOCALE [8] control loop and self-governing; that is, the system senses changes in itself and its environment, and determines the effect of the changes on the currently active set of business policies [39, 40]. To strengthen self-awareness, the cognitive control loop employs a model of human intelligence that is built using simple processes, which interact according to three layers: reactive, deliberative, and reflective [27][28]. The cognitive control loop employs cognitive processes, as shown in Figure III.2.

45 III. Cognitive Network Management Architecture 31 Reflective Learn Reason Deliberative Plan Decide Normalize Reactive Observe Compare Act Managed Element(s) Figure III.2: Conceptual model of the cognitive control loop The system is able to recognize when an event or a set of events has been encountered before. On the basis of history and features of events, the reactive process makes a decision and executes without any planning. This reactive mechanism enables much of the computationally intensive portions of the control loop to be bypassed, producing shortcuts that result in a decision being made immediately without timeconsuming higher-level process involvement. The deliberative loop is the same as that of the original FOCALE control loop process that takes the Observe-Normalize- Compare-Plan-Decide-Act path. This uses long-term memory to store the methods that were used to satisfy goals on a context-specific basis. The reflective loop examines the various conclusions made by the set of deliberative loops being used, and tries to predict the best set of actions that will maximize the goals being addressed by the system. This process uses semantic analysis to understand why a particular context was entered and why a context change accrued to help predict how to more easily and efficiently change contexts in the future. These results are also stored in

46 III. Cognitive Network Management Architecture 32 long-term memory, so that the system better understands contextual changes to its reasoning in order to aid debugging. Reactive Reactive Deliberative Context Deliberative Reflective Context Reflective Policy Observe Normalize Compare Decide Policy Observe Normalize Compare Decide Learn Reason Act (a) Inner control loop Learn Reason (b) Outer control loop Act Figure III.3: The cognitive control loops The cognitive control loop makes a fundamental change to the FOCALE control loops, as shown in Figure III.3.Instead of using maintenance and reconfiguration control loops, the cognitive control loop uses a set of outer control loops that affect the set of inner control loops. The outer control loops are used for large-scale adjustment of functionality by reacting to context changes. The inner control loops are used for more granular adjustment of functionality within a particular context. In addition, both the outer and inner control loops use reactive, deliberative, and reflective reasoning, as appropriate. Two important changes have been made to the inner loops. First, the Decide function has been explicitly unbundled from the Act function. This enables additional machine-based learning and reasoning processes to participate in the determination of which actions should be taken. Second, the new cognition model has changed the old FOCALE control loops, which were both reflective in nature, to the three control loops shown in Figure III.2, which are reactive, deliberative, and

47 III. Cognitive Network Management Architecture 33 reflective. Figure III.3 shows the new outer loops of the cognitive control loop. The reactive loop is used to react to context changes that can be handled by network elements themselves; the deliberative loop is used when a context change is known to have occurred, but its details are not sufficiently well understood to take an action; and the reflective loop is used to better understand how context changes are affecting the goals of the AM. 3.4 Hierarchical Autonomic Management Architecture As the number of network devices is increasing, centralized management is not valid for the Future Internet as well as current Internet. We propose hierarchical autonomic management architecture to support scalable network management. We specify the functionalities of a management element and how they can be organized with hierarchy [41]. Autonomic Element (AE) is an abstraction that allows the cognitive control loop to provide distributed functions such as communication, learning, reasoning, and management. AEs are grouped together in cooperating communities, or clusters. AE is managed entity itself or a parent AE, which manages other child AEs or managed entities. In the future, network devices may have more intelligence for managing themselves, so routers or switches can be AEs. However, existing network devices have no functionality for managing themselves and an AE is an agent which manages network devices. Figure III.4 shows a block diagram of an AE. It is composed of an Autonomic Manager (AM), a Context Manager (CM), a Policy Manager (PM), and a Knowledge Base (KB). An AE can manage a child AE or Managed Resource (MR) as a Management Entity (ME). When an AE manages a

48 III. Cognitive Network Management Architecture 34 MR, a Model Based Translation Layer (MBTL) translates vendor-specific data and command to vendor-neutral data and command (or vice versa) because MRs are devices from various vendors. Figure III.4: Block diagram of autonomic element Translated vendor-neutral data is sent to a CM. A CM collects data from a ME and stores it to a context repository. A CM detects changes in the network, or in user needs, or even in the business; these context changes in turn activate an associated set of policies that define the functionality the autonomic manager should govern. Other AE accesses context information via a context interface (CI). A PM stores available policies to a policy repository and Other AE configures policies via a policy interface. An AM decides appropriate actions for the current context using a state manager, action manager, learner, and reasoner based on FOCALE cognition control loops. The decided action is set to the ME. CogMan selects one of the reactive, deliberative, and reflective control loops based on the current state and collected context. A system built in accordance with FOCALE is self-governing,

49 III. Cognitive Network Management Architecture 35 in that the system senses changes in itself and its environment, and determines the effect of the changes on the currently active set of business policies Figure III.5: Example of autonomic element hierarchy CogMan uses hierarchical autonomic network management architecture. With complex requirements for network management and the increasing number of devices, network management has become more complex. Therefore, it is hard to manage network with a single network management server. Multiple nodes will be distributed in the network for managing current networks as well as future networks and more nodes will be needed for management as the size of a network domain is getting larger. CogMan adopt hierarchy for two reasons. First, hierarchy distributes processing load of management nodes by hierarchical level. Second, efficient exchange of management information is enabled by hierarchy. Consider a pure peer-to-peer based network management architecture. Management nodes exchange management information between every pair of cooperating nodes and this could be overhead in terms of network bandwidth. Hierarchy in managing nodes reduces overhead of exchanging management information because managing nodes send management information to the designated node [41].

50 III. Cognitive Network Management Architecture 36 Figure III.5 shows the structure of a hierarchical AE and gives an example of how the hierarchy can be mapped to the physical infrastructure. All Autonomic Network Domain Manager (ANDM), Autonomic Network Manager (ANM), and Autonomic Element Manager (AEM) are AEs. We designed this hierarchy based on the DEN-ng information model which describes a network management hierarchy. A parent AE governs the management decisions of its child AE, such as coverage of monitoring and performance metrics to be measured. However, child AEs also have their own autonomic decision-making capabilities. The cognition control loop plays a role of main decision engine of AE and monitoring data is sent by child AEs or managed entities. A number of AEs should be distributed in the network. If there is only one AE in the network, it is the same as a traditional client-server model and has problems of communication like a single point of failure because all the management information should be delivered to a single AE. 3.5 Chapter Summary In this chapter, we presented the proposed architecture. We also described the cognitive control loop that provides three different loops for handle situations. Finally, we described hierarchical architecture of CogMan for scalable network management.

51 Chapter IV Fault Management in SDN In this chapter, we present how CogMan is applied to manage faults in SDN. Since OpenFlow [13] is a representative protocol that implements the SDN architecture, we implemented our proposed methods in OpenFlow networks to validate them. First, we describe detailed processes for managing faults in SDN. We then present our experimental setup and emulation environments to validate the proposed methods. Finally, we show that our methods recover from failures faster and reduce overhead of the management system. 4.1 Introduction In order to validate our control loop, we apply the proposed control loop for managing failures in SDN. Management and control complexity is high in traditional networks due to the tightly coupled control and data planes. In traditional networks, to deal with a link failure, actual failure recovery of running flows, such as

52 IV. Fault Management in SDN 38 the changing of routes, is done by distributed routing protocols on networking devices. It is difficult for network operators to become involved in real-time control. On the other hand, new networking architectures, such as SDN and OpenFlow [13], separate data and control planes for packet networks, and an OpenFlow controller implements the control and management plane functions of traditional switches or routers. Consequently, interactions between control and management functions are more significant and our control loop plays an important role in both control and management. Failure recovery in OpenFlow networks has different aspects to that of traditional networks. Restoration and protection are well-known failure recovery mechanisms in OpenFlow networks [14]. Because the controller redirects flows one by one in restoration, recovery takes a long time. Protection sets a backup path in advance, such as Multiprotocol Label Switching (MPLS) path protection. This mechanism provides fast failure recovery that meets the failure recovery requirement of carrier grade networks, 50 ms [15]. However, there are scenarios that protection cannot cover, such as multiple failures that affect both working and backup paths. When there is no available backup path, restoration is the only the possible method, but its recovery time is relatively long. We propose a Fast Flow Setup (FFS) algorithm to handle failures that protection cannot cover. By reducing packet exchanges between switches and a controller, FFS provides fast flow redirection as opposed to restoration. To validate the cognitive control loop and the FFS algorithm, we present details of the processes used by the cognitive control loops to manage failures in OpenFlow networks. We then evaluate the performance of our control loop by conducting failure recovery experiments in an OpenFlow testbed. In a multiple failure scenario, the recovery

53 IV. Fault Management in SDN 39 Reactive case A failure successfully recovered by path protection Affected flows are short duration flows Controller involvement is unnecessary Cognitive control Deliberative case A failure successfully recovered by path protection Affected flows are long duration flows The controller should change paths to optimized path based on characteristics of flows E.g., High-bandwidth real-time video traffic Reflective case Multiple failure & redirected paths are affected by a failure Protection mechanism is not able to recover failures Controller should find another available path and sets up affected flows to the new path Figure IV.1: Reactive, deliberative, and reflective cases for managing failures time of the proposed method proved to be approximately 45% shorter than that of existing methods. 4.2 Detailed Fault Management Processes As outlined in Section 2.4, protection is efficient for recovery of a single failure, but it cannot successfully recover failures if backup paths are not available. This means that a single solution cannot solve all failure situations in OpenFlow networks. Different failure situations require different ways of managing failures. As MAPE and FOCALE control loops only provide a single loop for control, they do not have any inherent functionality to provide different solutions for different situations. The cognitive control loop analyzes failures and handles them using an appropriate technique. When failures occur in a managed OpenFlow network, the cognitive control loop first analyzes failures to decide on an appropriate loop for the failures. As shown in Figure IV.1, the cognitive control loop classifies failure cases as reactive, deliberative,

54 IV. Fault Management in SDN 40 or reflective cases in order to manage failures using the most appropriate loop and method. In our approach, a basic mechanism for managing failures is protection. Thus, during flow setup, a controller sets working and backup paths for every flow. Protection guarantees recovery from a single failure in a required time interval, as every flow has one backup path. A single failure case is classified as a reactive or a deliberative case. Both cases have in common the fact that a failure can be recovered by protection. However, as discussed in [43], short-duration and long-lived flows should be treated differently. A short-duration flow disappears quickly, so redirection to an optimal path is not necessary. A long-lived flow, such as a high-bandwidth real-time flow needs an optimal path to meet its requirement. As available bandwidth dynamically changes over time, the deliberative loop finds the optimal path and redirects the affected long-lived flows from a temporary backup path. If no backup path is available for failures, the reflective loop handles the failure cases. So far, restoration is the only possible method to use if protection does not work. However, restoration takes a huge amount of time [14] and it is not feasible to provide services. We propose the FFS algorithm for fast failure recovery. The reflective loop exploits FFS to process multiple failure situations that cannot be handled by the reactive and deliberative loops. In this section, we present in detail the processes used by the cognitive control loop to manage failures in OpenFlow networks. (1) The Observe process When the management system, which exploits the cognitive control loop, receives alarm messages from a managed network element, the Observe process

55 IV. Fault Management in SDN 41 extracts information about the alarm messages. An instance of the alarm is represented as a tuple of attribute values. An alarm is defined formally as follows: A := {a n, r t, r id, t a } Where a n is the alarm name, r t is the resource type, r id is the ID of the resource, and t a is the time at which the alarm occurred. The values of the attributes are set in this process and the alarm instance is sent to the Normalize process. Algorithm 1 Normalize (observed) 1: Input: observed //set of alarm instances 2: Output: clusterset //set of alarm clusters 3: R = {} //alarms related to a 4: C = {} //an existing alarm cluster related to a 5: for a observed do 6: R = relatedalarms(a) 7: C = getclusters(timegap, t a ) 8: if c C contains r R then 9: add a to the cluster c 10: add c to clusterset 11: else 12: create a new cluster nc and add a to nc 13: add nc to clusterset 14: end if 15: end for 16: return clusterset (2) The Normalize process In the Normalize process, alarm instances are correlated temporally and spatially. As shown in Algorithm 1, alarm instances received from the Observe process are correlated based on correlation rules. R and C are sets associated

56 IV. Fault Management in SDN 42 with a. Correlation rules for a and existing alarm clusters that have time gaps less than timegap are extracted from the knowledge base. If a certain cluster c contains alarms r, which are related to a, a is added to cluster c and c is added to clusterset. If there is no cluster associated with the alarm instance, a new alarm cluster nc is created and the alarm instance a is added to nc. All of the alarms from the Observe process are processed and clusterset is returned. For example, assume that alarms {1, 2, 3, 4} are collected from the network topology in Figure IV.2. Each alarm is examined and clustered with related alarms. Assume that {2} is processed first and there is no cluster related to {2}. Then, a new cluster c1 is created to which {2} is added. {1}, {3}, and {4} are added to c1 one by one due to the relationships between them, as shown in the causality graph. We assume that {2, 4} are symptoms and {1, 3} are the root cause problem according to the causality graph; consequently, the root cause flags of {1, 3} are set to true and the root cause flags of {2, 4} are set to secondary. For alarm clustering, any correlation algorithm can be used. For example, we used codebook based alarm correlation for clustering [31]. The alarm cluster c1 is then sent to the Compare component. (3) The Compare process As shown in Algorithm 2, the Compare process receives alarm clusters. This use case focuses on alarms caused by link down or port down. Since the alarms are clustered in the Normalize process, every cluster already has information that identifies the root cause of the alarm. Based on the alarm information, the current situation is classified as either a reactive, deliberative, or reflective case,

57 IV. Fault Management in SDN 43 Working path A-B-C Backup path A-D-E-C A p1 p2 p3 p1 D B p2 p1 p2 p1 p3 p3 E Example topology p#: # is a port number p1 p2 C VoIP packet loss {2} {1} Port Down C: port #1 Ping test failed A C Port Down B: port #2 {3} {4} {5} Ping test failed A E Figure IV.2: Example of a causality graph as shown in Figure IV.1. If a single failure occurs or backup paths are available for the affected flows, protection handles this situation. The controller can detect whether protection works by examining the paths of the affected flows. If the affected flows are short-duration flows, we classify the case as a reactive case. However, if the affected flows are long-lived flows, it is considered to be a deliberative case. The deliberative loop finds the best path for the flows and redirects the flows to that path. To find the path, the Plan & Decide process is called. If both working and backup paths are unavailable, the compare process classifies this case as a reflective case. Since the failures cannot be recovered by switches, the reasoning process implemented in the controller handles this case. (4) The Plan & Decide process A single failure is successfully recovered from by switches, without any intervention from the controller. Despite successful redirection of flows, long-lived flows need additional actions. The Plan & Decide process finds the best path

58 IV. Fault Management in SDN 44 Algorithm 2 Compare (clusterset) 1: Input: clusterset //set of alarm clusters 2: Output: none 3: if numof F ailure(clusterset) == 1 isp athp rotectionf ailed() == f alse then 4: f lowentries = getaf f ectedf lowentries(clusterset) 5: for f flowentries do 6: if islonglivedf low(f) == true then 7: P lananddecide(f, c) deliberative case 8: end if 9: end for 10: else 11: Reasoning(clusterSet) reflective case 12: end if that meets the requirements of the affected long-lived flow and sets up flow entries to switches in the path. (5) The Act process In the Act process, the controller receives high-level actions from Plan & Decide or Reasoning processes, which it translates to OpenFlow-specific commands and sends to the target switches. For example, flow entry install, modification, and other OpenFlow commands for configuration actions are sent to target switches. (6) The FFS process The FFS process is for failures that cannot be recovered by a predefined plan. Although protection enables recovery from a single failure in a required time interval, it is not possible to recover if multiple failures occur and both working and protected paths are affected by them. It is of course possible to successfully recover from multiple failures if the controller sets more than one protected

59 IV. Fault Management in SDN The controller obtains the path <ACDE> for the affected flow 3. The controller sends an flow entry to switch C (output action: port 2) a 4 A 3 C Controller B D 4. The controller sends an flow entry to switch D (output action: port 1) 2. The controller sends an flow entry to switch A (output action: port 3) The controller sends an flow entry to switch C (output action: port 2) Figure IV.3: Example of restoration E 2 b Working path Backup path 3 rd path path. However, the overhead for setting up the flow is high and an error may occur when setting backup paths. Currently, restoration can be used if protection is not available, but the failure recovery time of restoration is very long and is therefore not acceptable for the provision of network services [14]. Therefore, we need a better method to recover from failures quickly in conditions that have not been prepared for. We propose the FFS algorithm to handle failures quickly when there is no available backup path. The main idea underlying FFS is reduction of the amount of communication between the controller and switches by implanting path information in an entry switch when the controller sets up a flow. As shown in Figure IV.3, if working and backup paths have failed, restoration can be used to recover from the failures. In restoration, the controller sets up a new path for every affected flow one by one. The number of packet exchanges needed to redirect a single flow is the same as the length of a new path. N extra packet exchanges are required to set up an N-switch unidirectional path in restora-

60 IV. Fault Management in SDN 46 OpenFlow header (controller switch) Match Fields Counters Instructions Flow entry format in a flow table struct ofp_action_output { uint16_t type; uint16_t len; uint32_t port; uint16_t max_len; uint8_t pad[6]; }; flag Switch E Switch D Switch C Output action IP options flag 1 NULL NULL pad [0] pad [1] pad [2] pad [3] pad [4] pad [5] Outport number Switch E Switch D 1 NULL NULL NULL 1 1 hop[0] hop[1] hop[2] hop [3] hop [4] hop [5] Switch C consumes and updates Figure IV.4: Fields used for FFS tion. Therefore, if the number of affected flows is M, M N-packet exchanges are required. Excessive packet exchanges are not only overhead for the controller, but also add latency. Since each network has a single controller, the controller sends flow entry setup messages in succession. If the time for sending a flow entry is T seconds, a single flow redirection takes at least N T seconds. The FFS algorithm only requires one-packet exchanges for flow setup, so the algorithm reduces the overhead of the controller and increases the speed of flow redirection. To implant path information in a switch, we use a field that is unused in a normal OpenFlow message. We will now briefly describe how a controller puts path information in a switch. As shown in Figure IV.4, a flow entry consists of match fields, counters, and instructions. Match fields are for flow matching with source, destination addresses, etc. The counters field indicates the number of forwarded packets that belong to the flow. The instructions field specifies

61 IV. Fault Management in SDN 47 the actions to be executed when a packet matches the entry. These instructions result in changes to the packet, action set, or pipeline processing. The output action is a type of instruction that makes a switch forward matching packets to a designated output port. We use the pad field of the output action, which is simply used to make an output action 64-bit aligned, thus, it is actually unused. Pad[0] is a flag that indicates whether this output action contains a path for FFS or not. A value of 1 means that the output action has a path for FFS. Path information is encoded in pad[1]-pad[5]. Due to restrictions on header size, switch ID is not included in the encoded path but an output port on each switch is specified. Since we use the pad field of the output action, which consists of five elements in addition to pad[0], the maximum length of the path that can be represented in our algorithm is six hops. We believe that six hops are enough to support a path change considering a small domain network. We can also increase the available length if necessary. For example, if we put two output port numbers to one pad element, FFS can support a 12-hop maximum path. However, in this case, the maximum port number we would be able to represent would be reduced to 2 4. As shown in Figure IV.4, the output port number is set from pad[5] to pad[1] in reverse order. The value 2 of pad[5] means that the matching flow should be forwarded to port 2 on switch C. Algorithm?? explains in detail the process used when each switch receives a new packet. When a packet that has no matching flow entry arrives, the switch examines its IP option field first. If the flag is set to 1, the switch obtains an

62 IV. Fault Management in SDN 48 Algorithm 3 FastFlowSetup(packet) 1: Input: packet 2: Output: none 3: if isn ewf low(packet)==true then 4: if IP option flag == 1 then 5: construct new flow entry fn with an output port specified in IP opton[5] 6: add flow entry f to flow table 7: shift one element right of IP option[1-5] 8: end if 9: else 10: if output action flag ==1 && isanyp ortdown()==true then 11: put the output action path into IP option 12: set the output action path to NULL 13: send the packet to the output port in output action 14: end if 15: end if 16: output port from an IP option field, instead of asking the controller for output port information. Then, the switch deletes the output port in IP options and shifts one element right. If the packet has a matching flow entry and there is a failed port in the switch, the switch copies a remaining path from a pad field to an IP option field in the packet. Then, the switch sends the packet to a designated output port. Figure IV.5 is an example scenario of failure recovery with the FFS algorithm. Link failures have occurred at link(a, B) and link(a, D) in a short time interval, so both working and protected paths are affected by the failures. The controller detects that protection cannot handle the situation and starts FFS. To redirect affected flows, the controller first gets the new available path ACDE first. (We assume that the controller calculates all the available

63 IV. Fault Management in SDN The controller obtains the path <ACDE> for the affected flow 4 a 2. Switch A puts path into an IP option field into a IP option packet (src:a, dst:b) 1 null null A 3 C 3. Switch C sets a flow entry as specified in an IP option field and send a packet to D 1 null null IP option Controller B D 2. The controller puts path <CDE> into output action and sends an flow entry to switch A (output action: port 3) 1 pad E 4. Switch D sets a flow entry as specified in an IP option field and send a packet to E 1 1 null null null 2 1 IP option Figure IV.5: Example of FFS 1 null null Switch E sets a flow entry as specified in IP option field and send a packet to host b 2 b 1 null null null null 2 Ip option Working path Backup path 3 rd path paths in its initializing time, so path calculation overhead is not considered.) More specifically, an actual path in OpenFlow controller is represented as a node-port tuple list. Therefore, ACDE is represented as (A : 4, A : 3), (C : 1, C : 2), (D : 2, D : 1), (E : 1, E : 2) in the controller logic. For example, (E : 1, E : 2) means that input and output ports of the matching flow in switch E is 1 and 2, respectively. After checking new path ACDE, the controller adds a flow entry to the first hop, switch A. The flow entry instructs switch A to forward matching packets to port 3. When the controller sets the flow entry to switch A, it implants the remaining path CDE in the flow entry in order to reduce time for setting the remaining path. This is the core idea of our algorithm; the controller implants the path information into the first hop switch, and it does not add flow entries to all the switches along the path. The remaining path is consumed by the switches in the path. If the received packet has a matching flow entry, it checks the output action

64 IV. Fault Management in SDN 50 field to forward the packet and examines pad[0], a flag value. Pad[0] is set to 1, so the path set in the pad field is copied to the IP option field of the matching packet so that the path can be set along the designated switches. Switch A modifies pad[0] to 0 and sends the packet to port 3. (Switch A changes pad[0] to 0 to prevent redundant flow setup process.) Switch C then receives the packet from switch A. Switch C has no matching flow entry for the packet. In the original OpenFlow protocol, a switch asks a controller to decide an action for the packet. This step requires extra packet exchanges, because every switch in the path would need to exchange packets to get action for the matching packet. However, FFS examines the IP option and if the flag is 1, it adds a flow entry with an action that forwards matching packets to port 2. Switch C then shifts one element of IP option[1-5] right and forwards the packet to port 2. Switches D and E handle the packet the same way switch C processed it, and so the flow is successively set along the path ACDE and the packet is delivered to the host b. Obviously, the failure recovery time of FFS is longer than that of protection because protection does not have to communicate with the controller. However, FFS is valid when backup paths of protection are not available. 4.3 Prototype Implementation To validate the proposed control loop and the FFS algorithm, we implemented a prototype that manages failures in OpenFlow networks. Our testbed consisted of six

65 IV. Fault Management in SDN 51 OpenFlow switches, 30 hosts, and a controller, as shown in Figure IV.6. The testbed was a typical OpenFlow network in which switches are controlled by a centralized controller. A solid line represents a link between switches and hosts and a dashed line represents a secure connection between the controller and a switch, only control packets are exchanged in this connection. We used Floodlight [44] as the centralized controller of our OpenFlow network testbed because it is one of the most popular OpenFlow controllers. The FFS algorithm was implemented in each OpenFlow switch by modifying the current Open vswitch 1.90 [45]. We represent the implementation of the cognitive control loop as CogMan. Cog- Man is implemented as a type of module like the other basic modules and exploits other modules for fault management. For example, CogMan obtains topology information from a topology manager module to infer a correlated link or a port when a certain switch port is down. A routing module is used to find available and alternative routes. We also made modifications in order to compare CogMan with other existing methods, and implemented protection and restoration using the FOCALE control loop. In FOCALE, a single failure and multiple failures are handled by protection and restoration, respectively. We also implemented MAPE; however, there is no significant difference in performance between MAPE and FOCALE. Therefore, we compare only FOCALE and CogMan in this paper. We built the testbed with two machines. The controller was implemented on one machine and the OpenFlow network emulated in the other machine using Mininet [46]. These two machines were connected via Ethernet through a switch. We extended the Mininet software to add functionalities such as injecting link down failures to target links and measuring end-to-end delay.

66 IV. Fault Management in SDN 52 CogMan FOCALE MAPE Management module Protection FFS algorithm CogMan FOCALE MAPE Floodlight Controller Controller core Host 1 S1 S3 S4 S6 OpenFlow network Topology construction Fault injection S2 S5 Host n Figure IV.6: A CogMan prototype: fault management 4.4 Evaluation We conducted various failure recovery experiments using the testbed shown in Figure IV.6. We compared the failure recovery time of CogMan and FOCALE. We also analyzed the recovery time for single and multiple failure cases. In addition, we measured the ratio of control packet exchanges that occurred during recovery. The control packet exchange ratio is also an important factor in OpenFlow networks as the packet exchange ratio directly affects the performance of the controller. First, let us look at the results of the single failure recovery experiments when link(s3, s5) failed in our testbed. Before injecting the failure, hosts connected to switches sent ping messages to other hosts in order to set the flow tables of switches. Next, a number of hosts sent UDP packets to other hosts at 10 ms intervals. At approximately 0.7 s, we injected a single failure in link(s3, s5). We then measured the end-to-end delays of UDP packets and the number of dropped packets. We

67 IV. Fault Management in SDN end to end delay (ms) ms end to end delay (ms) ms packet count (a) CogMan (protection) packet count (b) Restoration Figure IV.7: Recovery time of UDP flow when a single failure occurs calculated the failure recovery time by multiplying the number of dropped packets by the packet sending interval of 10 ms. Figure IV.7a shows the end-to-end delay between a host connected to switch 3 and a host connected to switch 5. When the flow was set in protection mode, working path s3 s5 was assigned for the flow. At the same time, the backup path, s3 s2 s5 was also set to the switches in the path. On detecting a link failure, the affected flows were redirected to the backup path. The link failed at 0.7 s and the recovery time was approximately 10 ms. Figure IV.7b shows end-to-end delay and the recovery time of the restoration mechanism when the number of affected flows was 30. The recovery time was 100 ms, and the packet delay just after recovery was longer than that of the other packets. We surmise that this occurred because the processing load of the machine emulating the OpenFlow network increased as the controller sent excessive control messages to switches. As many hosts and switches reside in the same machine, the load of the machine affected the packet processing

68 IV. Fault Management in SDN 54 performance of the switches. We also conducted multiple link failure experiments. We injected failures on two links, (2, 3) and (3, 5). One link belonged to the working path and the other, to the backup path. Because both working and backup paths are affected by failures, protection cannot handle them. FOCALE uses an existing restoration mechanism to recover when protection does not work. CogMan uses the FFS algorithm for recovery. In a multiple failure scenario, FOCALE and CogMan have something in common in that the controller redirects all the affected flows in succession and recovery time is proportional to the number of affected failures. We measured recovery time against increasing number of affected flows, as shown in Figure IV.8. The x-axis shows the number of affected flows, while the y-axis shows the minimum recovery time, the maximum recovery time, and the average recovery time. The minimum recovery time is the time spent to recover the first affected flow. The maximum recovery time is the time for the last affected flow. The average recovery time is the expected time for any flow to be recovered after failures. The average recovery times of CogMan and FOCALE were proportional to the number of affected flows. When the affected number of flows was 100, the maximum recovery time for FOCALE was 813 ms. If the number of flows is large in a network, restoration is not acceptable for managing failures. The failure recovery time of FFS, which is used in CogMan, was approximately 45% shorter than that of restoration, irrespective of the number of affected flows. If the path length is greater, the time difference will increase because the controller sends N packets for every affected flow when the length of the path is N. In this experiment, the path length was set to 3. Figure IV.9a shows the number of packets exchanged per second during failure

69 IV. Fault Management in SDN 55 recovery time (ms) Minimum (CogMan) Average (CogMan) Maximum (CogMan) Minimum (FOCALE) Average (FOCALE) Maximum (FOCALE) number of affected flows Figure IV.8: Recovery time (multiple failures) FOCALE CogMan 100 FOCALE CogMan packet exchange ratio (pps) control traffic (KBps) experiment time in second experiment time in second (a) Control packet exchange ratio (multiple failures) (b) Control traffic (multiple failures) Figure IV.9: Control packet exchange ratio and traffic generated during failure recovery (multiple failures)

70 IV. Fault Management in SDN 56 recovery when the number of affected flows was 50. The links failed at about 0.2 s, and port down alarms sent to the controller from switches at 0.3 s. FOCALE generated excessive control packets during failure recovery. The packet exchange ratio increased up to 820 pps, and then the ratio remained between 590 and 990 pps before the controller redirected all affected flows. CogMan exchanged less control packets than FOCALE. The packet exchange ratio of CogMan increased to 360 pps, and then remained constant between 340 and 360 pps. An interesting aspect is that the maximum packet exchange ratio of FOCALE was greater than that of CogMan. CogMan gets a path for every affected flow and implants the path into the flow entry message, and then it sends the flow entry to only one target switch, which is the first hop in the path. However, FOCALE gets a path for every affected flow and sends flow entries to target switches one by one. The time gap between sending flow entries to the switches in the same path is very small. CogMan needs more time before sending a flow entry to a switch to implant a path into the OpenFlow message. Figure IV.9b shows the amount of traffic generated. The shape of the graph is similar to that of Figure IV.9a, which means that the size of control packet is not very variable. The peaks of the control traffic of FOCALE and CogMan are and 28.8 s. The bandwidth required for failure recovery of CogMan was also about 70% lower than that of FOCALE. Protection provide the fastest failure recovery comparing to protection and FFS. However, communication overhead in flow setup time is much higher than the other mechanisms. Figure IV.10a shows the number of packet exchanges between switches and the controller in flow setup time. The x-axis is the number of flows that requesting flow setup to the controller near simultaneously. As the number of requesting

71 IV. Fault Management in SDN 57 flows is larger, the number of extra packet exchanges is larger for both normal and protection cases. Figure IV.10b shows the difference in the number of packet exchanges between normal and protection. Actual flow setup requires more packet exchanges comparing to the analytical analysis, since the switches and controller generate additional packets as many new flows are established. Therefore, overhead of the controller is not negligible if the network is busy number of packet exchanges normal protection difference between protection and normal flow setup analytical difference measured difference number of flow setup number of flow setup (a) Number of packet exchanges for flow setup (b) Difference in the number of packet exchanges (multiple failures) Figure IV.10: Number of required packet exchanges for flow setup We have shown that the proposed control loop and the FFS algorithm provide faster failure recovery than existing methods by conducting various failure recovery experiments. From our implementation and experimental experience, we discovered that the performance is not highly dependent on control loops. For example, MAPE and FOCALE also can be implemented with the FFS algorithm to give similar results. However, operators should make policies for every situation for MAPE and FOCALE to design and implement a management system. The cognitive control loop provides the reactive, deliberative, and reflective loops intrinsically and it helps

72 IV. Fault Management in SDN 58 network operators to provide fine-grained management for various situations. 4.5 Chapter Summary In this chapter, we applied the proposed architecture and algorithm to manage faults in SDN. The detailed processes for managing faults and experiment results are described. We showed that the proposed methods recover from multiple failures faster than the restoration mechanism by emulation in our testbed.

73 Chapter V Autonomic Fault Diagnosis based on Association Rule Mining 5.1 Introduction There are some approaches for the design of the Future Internet: revolutionary and evolutionary [47, 48]. In this design, management of the Future Internet is one of the important topics. However, we do not have a clear picture of the Future Internet yet and many emerging technologies are investigated for the Future Internet. For example, Content Centric Networking (CCN) is the one of the hot issues and network virtualization and autonomic networking will be the key technologies for the Future Internet [49]. Although a new Internet architecture substitutes the current Internet architec-

74 V. Autonomic Fault Diagnosis based on Association Rule Mining 60 ture, a basic paradigm of network management will not be changed. The paradigm is to understand current status of network and take the appropriate actions. In order to understand the network status, we need to monitor network devices, links, and servers. Network administrators suffer from lots of network events and alarms. Enterprise networks generate millions of network alarms per day. In cloud computing or virtualized network environment, there will be more network events and alarms to be analyzed. In addition to physical entities, alarms related to virtualized resources will be generated. Existing rule based and case based alarm correlation approaches need manually defined rules and cases based on assumption that a managed network is stable. However, there might be a missing dependency between alarms and a manual modification is necessary when a managed network is changed. For example, if a topology of the managed network is changed, some rules related with a topology should be changed manually. Therefore, it is necessary to update a dependency model with learning. Alarms contain information about serious status of network resources, such as link, router, switch, etc. However, this fragmentary information does not tell the impact of a certain problem. Serious and urgent alarms need to be detected and processed more quickly than normal alarms. We propose an efficient fault management approach based on a cognitive control loop which is a part of CogMan. The cognitive control loop determines priorities of network alarms, processes alarms with three different control loops, and then infers root causes of the problems based on learning and reasoning. In order to evaluate our approach, we synthetically generate alarms, correlate and analyze them to find root causes. In addition, we propose ontology for determining the priorities

75 V. Autonomic Fault Diagnosis based on Association Rule Mining 61 of alarms. Urgent cases are treated immediately with specified actions. Otherwise, possible sets of actions are examined and the most appropriate one is selected. In our experiment, 16 different alarms are reduced to four clusters by using learned rules and our clustering algorithm. It means that the effort and time of higher-level network manager s can be reduced. 5.2 Multiple Control Flows based on the Priorities of Alarms The cognitive control loops process network events and alarms. At the same time, relationships between alarms are leaned to adapt to changing environmental conditions. As shown in Figure V.1, multiple control loops are available based on a priority of an alarm. These control flows are mapped to the cognitive control loop in Figure III.2. In an observe phase, data is retrieved from the managed resource (e.g., SNMP polling or trap). Vendor specific data is translated to a normalized form based on the DEN-ng information model. Network alarms are filtered and correlated in order to efficiently find root cause alarms. In this phase, a dependency model is used to correlate alarms. At the same time, a normalized data is fed to a learning phase. Changing environment conditions are captured by learning, especially relationships between alarms are detected to update a dependency model. After correlating alarms in a normalize phase, a priority of the alarms is determined by classifying the alarm. The alarm is classified as urgent if this alarm affects serious performance degradation of network resources or services. Alarm priorities are determined based on a policy. If an alarm

76 V. Autonomic Fault Diagnosis based on Association Rule Mining 62 Figure V.1: Multiple control flows based on the control loop is urgent, a set of actions is sent to the network devices without passing through plan and decide phases. This is the difference from the previous version of FOCALE control loops. If the current state is a high priority, it skips a plan phase for taking immediate actions. For a low priority alarm, a plan phase takes a high-level behavioral specification from humans, and controls the system behavior in such a way as to satisfy the specifications. It means that a plan phase computes all the possible sets of actions to change the current state to a desired state. A decide phase chooses a set of actions which maximize a goal. Finally, an act phase sends commends for chosen action to target network devices. Model-based translation converts device-neutral actions to device-specific commands. Network alarms are correlated in the normalize phase of the cognitive control loops. First, alarm information is extracted to the form of Figure V.2. Typically, a single failure affects to other services and devices. Therefore, if a single failure occurs

77 V. Autonomic Fault Diagnosis based on Association Rule Mining 63 Figure V.2: Example of alarm generalization somewhere in the network, many alarms related to the failure are generated. Once a fault occurs, many identical alarms are generated to notify the fault before it is fixed. Those identical alarms are generalized as shown in Figure V.2. By alarm generalization, the number of alarms is reduced. Generalized alarms are then correlated to reduce the number of alarms and find root cause alarms. Alarm correlation depends on a basic dependency model and association rules detected in the learning process are added. As we mentioned, a learning process learns relationships between alarms. Figure V.3 shows a basic dependency model which is manually defined. It is based on the TCP/IP model. A lower layer problem affects to higher layers. For example, if a server link is down, an IP layer is also unavailable. At first, a manually defined dependency model is used. Additional rules learned from association rule mining are added to the basic dependency model to adapt to changing environment conditions. We describe how to make a set of clusters based on the association rules. Initially, alarms are generalized and grouped with a same alarm ID. It means that each alarm is a single cluster by itself in the initial phase. Then, all the association rules are

78 V. Autonomic Fault Diagnosis based on Association Rule Mining 64 Figure V.3: Example of a dependency model examined one by one and the corresponding clusters are merged. In this way, each cluster contains both alarms and relationship information. Therefore, root cause alarms can be analyzed easily. Based on relationships between alarms belonged to the same cluster, root causes can be inferred. 5.3 Association Rule Mining We can use various machine learning techniques for inferences. For efficient alarm correlation, it is extremely important to find relationships between alarms. In this paper, association rule mining is used to find the cause and effect from relationships between alarms. Table V.1: Alarm transaction data set TID Transaction item set 1 A1, A2, A4, A5 2 A1, A4, A5 3 A2, A3, A4, A5 4 A1, A2, A4 5 A1, A3, A5

79 V. Autonomic Fault Diagnosis based on Association Rule Mining 65 The transaction database is made of the alarm data in the managed network after pretreatment shown in Table V.1. Each transaction in a database has a unique transaction ID and contains a subset of the items. A rule is defined as an implication of the form X Y, where X,Y I and X Y= Ø. A priori association rule algorithm basically has two steps; the first is finding all frequent item sets in a data set by applying min sup (minimum support threshold).; the second is generating association rules based on the frequent item sets. For any transaction sets for X, the support for the X, sup(x), is defined as a portion of the transactions in the data set which contains the item set in Equation V.1. In Table V.1, support of A1 is 4/5 (80%). We assume that the default value of min sup is 10% and the support of A1 is greater than min sup. Therefore, rules related to A1 should be found. sup(x) = count(x) transactions 100(%) (V.1) Confidence of the rule X Y is defined in Equation V.2. In Table V.1, conf(a1 A4) is sup(a1 A4)/sup(A1) = 75%. Frequent item sets and the minimum confidence constraint are used to form rules. conf(x y) = sup(x y) sup(x) 100(%) (V.2) 5.4 Determination of Alarm Priorities One of the most important features of the cognitive control loops is that alarms are controlled differently based on their priorities. Urgent alarms can be processed faster than normal alarms. Classifying urgent alarms is dependent on a goal and policy of a

80 V. Autonomic Fault Diagnosis based on Association Rule Mining 66 network. We defined the ontology based on the DEN-ng information model to make effective semantic representations of network elements, alarms, and their priorities. Figure V.4 describes the concept of network elements and alarms for determining their state and priorities. An element is a network resource that has its own state, such as CPU utilization, link throughput, etc. An element provides services and notifies its state to a network administrator. A notification can be an alarm or event. An alarm has a destination, source, and type as described in Figure V.2. Alarms are classified into three classes: urgent, high priority, and low priority. An element also provides a service. A service has three classes: gold, silver, and bronze. A gold service is the most important service. Three alarm classes are defined based a policy of a managed network. For example, it can be defined by a network administrator that if an alarm affects to the Service Level Agreement (SLA) violation of a gold service, it can be classified as urgent. We use Semantic Web Rule Language (SWRL) [50] to make conditional rules into the ontology. We assume that alarms related to a gold service are urgent and a gold service is provided by the server WS2 in Figure V.5 and alarms related to WS2 are classified as urgent. For example, WS2 HTTP unavailable, WS2 IP down, or WS2 port down are urgent alarms needed to be fixed as soon as possible. The following SWRL rules are for classifying alarms based on our assumption. These SWRL rules determine alarms as urgent if alarms are about an element that provides a gold service.

81 V. Autonomic Fault Diagnosis based on Association Rule Mining 67 Figure V.4: Ontological model for alarms and priority Alarm(?a) hasalarmdst(?a,?dst) Element(?dst) providesservice(?dst,?s) GoldService(?s) (V.3) UrgentAlarm(?a) Alarm(?a) hasalarmsrc(?a,?src) Element(?src) providesservice(?src,?s) GoldService(?s) (V.4) UrgentAlarm(?a) 5.5 Evaluation In this section, we describe evaluation and its results for validating our proposed approach. We implemented the alarm correlation algorithm in a Java language and used the Weka [16] library for association rule mining. We generated synthetic alarm

82 V. Autonomic Fault Diagnosis based on Association Rule Mining 68 Figure V.5: Experimental Topology data sets for the experiment which correlates alarms and finds root causes. Figure V.5 shows the experimental topology composed of 22 nodes with four critical alarms. IP Down, Link down, Port down, and Router down occur randomly at designated nodes as shown in Figure V.5. A default route from R6 to R1 is through R3-R0-R1. If a link between R3 and R0 is down, N4 cannot connect to WS1 and FS2. We generated two synthetic alarm data sets for training and validation. For example, the link of the router R0 is down from 5 to 15 second and the WS1 s port is down from 20 to 30 second. If WS1 port down occurs, N1 and N8 generate an alarm WS1 HTTP Unavailable. We assumed that N1 and N8 periodically poll the states of all the servers in the network. Based on the generated synthetic alarm data set, the cause and effect relationships are detected. Table V.2 shows a part of alarms, alarm IDs, and detected as-

83 V. Autonomic Fault Diagnosis based on Association Rule Mining 69 sociation rules. A3 A4 and A3 A6 mean that the services on WS1 and FS2 are unavailable if the R0-R3 link becomes down. Based on the rules, when A3, A4, and A6 alarms are generated, A3 is identified as the root cause alarm. There are thousands of alarms in enterprise networks [51] and a large number of rules can be detected. Table V.2: Alarms and detected association rules Alarm ID Alarm Rule A3 R0-R3 link down A3 A4 A4 N4 WS1 HTTP unavailable A3 A6 A5 N1 WS1 HTTP unavailable A8 A6 A6 N4 FS2 FTP unavailable A9 A6 A7 N4 WS2 HTTP unavailable A3 A7 A8 A9 N1 FS2 FTP unavailable FS2 IP down Then, all the critical alarms described in Figure V.6 are generated simultaneously. The type and the number of generated alarms are described in Figure V.6. The node N1 generates WS2 HTTP unavailable and FS2 FTP unavailable alarms. However, in the specific time window, the node N1 generates multiple alarms when the fault is not fixed during the time window. Those alarms are including redundant and similar alarms. For example, the node N1 generates five WS2 HTTP unavailable and five FS2 FTP unavailable alarms. Those identical alarms are generalized as we explained in Section 4. Therefore, a higher level manager receives a reduced number of alarms. In our experiment, the generalization process reduces 100 alarms to 22 alarms. However, alarms generated by node N6 and N8 are not received because of the failure of R4.

84 V. Autonomic Fault Diagnosis based on Association Rule Mining 70 Figure V.6: Synthetically Generated Alarms from Each Device Figure V.7 shows an output of our clustering algorithm. Even if the total number of alarms is still large, we can find a root cause alarm in each cluster easily. For example, the cluster 1 consists of R0 link down, WS1 HTTP unavailable, and FS2 FTP unavailable alarms. Based on the association rule in Table V.2, we can infer that the root cause alarm is R0 link down. The cluster 4 consists of WS1 port down and WS1 HTTP unavailable alarms. Therefore, network administrators only can focus on the root cause alarm. Figure V.7 shows the number of clusters and alarms of each cluster. The root cause alarm of the cluster 1 is R0-R3 link down. The other alarms caused by R0-R3 link down are included in the cluster 1. The cluster 1 includes the other alarms caused by R0-R3 link down, such as N4 WS1 HTTP unavailable, N4 WS2 HTTP unavailable and N4 FS2 FTP