Decision Support Systems:

Transcription

1 Decision Support Systems: Diagnostics and Explanation methods In the context of telecommunication networks Martin Lindberg Umeå

2 Decision Support Systems: Diagnostics and Explanation methods In the context of telecommunication networks Master Science Thesis Martin Lindberg Supervisor at Ericsson: Stefan Avesand Examiner: Thomas Hellström Department of Physics Umeå University Umeå, Sweden April Umeå

3 Abstract This thesis work, conducted at Ericsson Software Research, aims to recommend a system setup for a tool to help troubleshooting personal at network operation centres (NOC) who monitors the telecom network. This thesis examines several different artificial intelligence algorithms resulting in the conclusion that Bayesian networks are suitable for the aimed system. Since the system will act as a decision support system it needs to be able to explain how recommendations have been developed. Hence a number of explanation methods have been examined. Unfortunately no satisfactory method was found and thus a new method was defined, modified explanation tree (MET) which visually illustrates the variables of most interest in a so called tree structure. The method was implementation and after some initial testing the method has gained some positive first feedback from stakeholders. Thus the final recommendation consists of a system based on a Bayesian model where the gathered training data is collected earlier from the domain. The users will thus obtain recommendations for the top ranked cases and afterwards get the option to get further explanation regarding the specific cause. The explanation aims to give the user situation awareness and help him/her in the final action to solve the problem Umeå

4 Sammanfattning Denna examensrapport, som har utförts åt Ericsson Software Research, har som mål att ta fram en rekommendation för utvekling av hjälpverktyg åt felsökningspersonal vid Ericssons nätverksunderhålls centraler. Rapporten undersöker olika typer av maskininlärningsalgoritmer och kommer fram till att Bayesianska nät lämpar sig för det ämnade systemet. Eftersom systemet skall ligga som beslutsstöd behöver det förklara hur den rekommenderade lösning har tagits fram. Därför har flera typer av förklaringsmetoder undersökts. Tyvärr lämpade sig inte någon av de undersökta algoritmerna. Detta ledde följaktligen till att en ny förklaringsmetod definieras modefied explanation trees (MET) vilket ger en visuell bild där de variabler som är av intresse från domänen visas i en så kallad trädstruktur. Efter implementation och initiala försök har den nya metoden bemötts med positive feedback från beställaren av arbetet. Den slutgiltiga rekommendationen består därför av ett Bayesianskt nät där träningsdata samlats in tidigare från de aktuella domänerna. Användaren erhåller då vid nyinkomna felanmälningar rekommendationer till de högst rankade orsakerna. Som tillval kan användaren därefter be om en förklaring till den aktuella rekommendationen och därefter bilda sig en uppfattning som leder denne till sitt slutgiltiga åtagande för att lösa problemet Umeå

5 Acknowledgement Daniel Schiess Thesis worker from Royal Institute of Technology Stefan Avesand Ericsson, Senior Software Researcher Umeå

6 Table of Contents 1 Introduction The Problem Problem definition Limitations Layout Collaboration Background Workflow Engine overview Smart workflows (SM) A simple example - IPTV Detection Analyse Plan Act Feedback Network managements How the NOC s work today What is a NOC? Structure of a typical (G) NOC Troubleshooting by FO in a NOC today Improvements Theory Algorithms overview Decision trees (DT) Bayesian network (BN) Support vector machines (SVM) K-Nearest Neighbours (KNN) Summary of algorithms Evaluation method Probabilistic reasoning Umeå

7 3.4 Bayesian Networks Naïve Bayesian Network Explanation methods and properties What is an Explanation Important Perspectives of an Explanation Explanation methods Feedback Method Literature analysis Experimentation Experimentations Evaluation of classification algorithms Connection R with Activiti Naïve Bayes Classifier Experiment Modification of Explanation Tree An Implementation of Explanation Methods in Matlab Conclusion and discussions Discussions regarding problem statements Recommended system Future works Bibliography Appendices... A 8.1 Explanation tree... A 8.2 R server, Activit and a standalone DB... B 8.3 Additional figures... E Umeå

8 List of acronyms and abbreviations AI Artificial Intelligence ACS Automatic Configuration Server BBN Bayesian Belief Network BN Bayesian Network BO Back Office BPEL Business Process Execution Language BPMN Business Process Modelling Notation BRAS Broadband Remote Access Server BSS Business Support System CLI Command Line Interface CPT Conditional Probability Table DAG Directed Acyclic Graph DAPA Detect, Analyse, Plan, Act DSLAM Digital Subscriber Line Access Multiplexer DT Decision Tress EM Expectation-Maximization FO Front Office FOPS Field Operations HW Hardware KNN K Nearest Neighbour MET Modified Explanation Tree NBC Naïve Bayes Classifier NBN Naïve Bayes Networks NCS Network Customer Support NE Network Element NMS Network Management Solution NOC Network Operation Centre OSS Operation Support System PDU Project Development Unit RGW Root Gateway SLA Service Level Agreements SOP Standard Operating Procedure STB Set-Top-Box SVM Support Vector Machine SW Software TAN Tree Augmented Naïve Bayesian classifier Umeå

9 List of figures Figure 1: Short example of a day to day routine modelled in a workflow... 6 Figure 2: Image showing the concept of smart workflows... 8 Figure 3: Serial connection Figure 4: Diverging connection Figure 5: Convergence connection Figure 6: Made up example with inspiration from [25] Figure 7: Naïve Bayes model Figure 8: Made up example with inspiration from [25] Figure 9: Illustration of the process to find the best explanation Figure 10: Workflow set up in Activiti with mocked output from data from R servers to give an intuition on how the workflow may look like Figure 11: Number of true TP and FN i.e. passages that were classified as alive and really survived and classified as dead and really did not survived Figure 12: Mean accuracy for repetitive training of NBC 5 times each time with random order in the training data, lines indicating 98% confidence interval, blue lower limit and red upper Figure 13: Mean accuracy for repetitive training of NBC 100 times each time with random order in the training data, lines indicating 98% confidence interval, blue lower limit and red upper Figure 15: Modified explanation tree with and. Blue lines indicate possible states and red lines the evidence path in the tree. The variable representing the node is written above the node circle and the state of the branch is written below the end of its line. The likelihood for the specific classification is written after each split and indicates the new likelihood with the evidence included Umeå

10 Figure 16: Modified explanation tree with and Figure 17: Modified explanation tree with and Figure 18: Modified explanation tree with and Figure 19: Modified explanation tree with but with no threshold level.... E List of tables Table 1: Information on different tasks in NOC-departments Table 2: Comparison of classification algorithms Table 3: CPTs for made up example illustrated in Table 4: CPTs for made up example illustrated in Figure Table 5: Confusion matrix from the naïve Bayesian classifier with the titanic data, 2/3 used as training data and 1/3 as validation data Table 6: Likelihoods calculated with the NBC for the different evidences Table 7: MPE explanation filling in the most probable missing evidence Umeå

11 Introduction 1 Introduction This report regards decision support for network operators dealing with troubleshooting in telecommunication networks. The aim is to suggest a method for development of an analysing tool that gives the user decision support suggestions of possible causes to an error. The report was requested by Stefan Avesand, senior researcher at Ericsson Software Research in Kista, Stockholm, in December The work started on 19 February The decision support tool will be part of a larger system known, internally at Ericsson, as Smart Workflows. The work is done as part of a master s thesis conducted at the Umeå University (UMU) 1.1 The Problem In pace of the rapid expanding telecommunication industry and accessibility the need of network management increases. As of today the network requires constant surveillance from operators to keep the communication flow running. This requires highly skilled operators that monitor the systems and solve the day-to-day upcoming problems fast and efficient. The operators need to handle network management, troubleshooting and maintenance. They need to scheduled maintenance and plan errorcorrection into the time-schedule of field-technicians and at the same time analyse incoming alarms and tickets. To meet the increasing demands the efficiency need to increase and automation of some parts would help the operators in their daily work. The telecommunication network includes several thousands of hardware products and most are connected to surveillance systems that detect abnormal behaviour, so called anomalies. When an anomaly is detected an incident management system, often referred to as trouble tickets systems, are activated and the operators get assigned to the task of finding the cause of the problem and develop a solution. This system is based on the expertise of the operators to solve the problem. This can be very time-consuming and thus a system that accelerates the process would both be helpful and cost effective. Such a system needs to collect and process immense amount of data originating in telecommunication network as well as being able to extract expertise knowledge from operators. Using the collected information introduces uncertainties into the system that has to be handled by some type of probabilistic theory. Since the system is intended to give the operator a decision support, the ability to explain what influenced the recommendation is crucial. Thus the system needs to be able to explain Umeå 2

12 Introduction the output from the algorithm and point out the main features that lead the system to its recommendation. The system will also have to use algorithms that enable the possibility of reinforcements from feedback loops which enables the system to make improvements in its recommendations. Some of the difficulties associated with analysing the data received from the telecomm market are the large amount of data, mixing of continuous and discrete data, uncertainties in the data and the difficulty to comprehend the whole picture. [1] Another challenge within the area is the current setup at the NOC today. More than 40% of the mobile traffic use Ericsson networks [2] but operators are usually employed to monitoring a whole net consisting of different hardware from separate suppliers with their own way of storing data and formats of handling troubleshooting. The multiple tools required to monitoring the system is hard to learn and not very intuitive. The learning curve of newly employed operators is steep and any system that shortens this process and at the same time help experienced operators with the daily work will be helpful Problem definition The aim of this thesis is to investigating suitable algorithms to make diagnostics of anomalies in telecom network data and recommend actions to an operator that solve the problem. It will also investigate how to present a description explaining what variables influenced the recommendation. The thesis will also investigate how such a systems should get feedback and use it to reinforce the algorithm at make better recommendation in the future. Objectives for the thesis are Identify possible algorithms to make a decision support system based on state of the art solutions. Investigate how to handle uncertainties for the different decisions. Investigate explanation method for the system to visualise the cause of the outcome. Examine possible feedback loops for future improvements. To make a analyse agent that produces recommendations for the smart workflow, the underlying algorithms needs to be of prediction/classification type. Based on the data, origination from the telecom network the algorithm also needs to handle mixed data and large datasets. Since a crucial section of this thesis is to make the operator aware of the key aspects which led to the recommendations the algorithms needs to be traceable and easy to visualize. Additionally the algorithm needs to handle uncertainties in some way and be adaptive towards feedback Umeå 3

13 Introduction Limitations This thesis will try to investigate what are the most promising methods for a recommendation support system based on state-of-the-art solutions used today and also try to set this together in the context of telecommunication. Thus the aim is to put together already known algorithms but in a new context. The limit in time will hinder the extent of the literature study and only the most promising elements will be investigated. Also due to time shortage this thesis will not include any investigation about how to evaluate different algorithms to build a Bayesian network and a discussion about discretisation of continuous data. 1.2 Layout This thesis consisted of eight sections and is structured as follows. Section 2 consists of background material and aims at giving the reader a basic understanding of what a workflow is, what Ericsson defines as a smart workflow and what network managements, in the context of a network operation centre (NOC), is. Section 3 consists of a theory section, based on the literature study made during the thesis work. The section starts with an overview of possible classification algorithms and an evaluation based on the criteria stated in the Problem definition. The section continuous with a description of an evaluation tool and a review of what probabilistic reasoning is. Next is an extended description of what a Bayesian network is which is followed by related explanation methods. Last in this section possible feedback methods are listed. Section 4 describes the method of which this thesis work has been conducted. Section 5 contains each experimental step made during the work, which have led to the final recommendation of a possible decision support system. Section 6 includes conclusions and some discussions. It starts with a quick discussion about the stated problem listed in section Afterwards the recommendation of a decision support system will be described and last an overview of suggestions for future work will be listed. Section 7 is a bibliography Umeå 4

14 Introduction Section 8 is an appendices with additional material referred to in the text. 1.3 Collaboration This report was written in close collaboration to another master thesis work called Decision Support Systems: Workflows by Daniel Schiess a student from Kungliga Teksniska Högskolan, KTH, which was conducted during the time at Ericsson. Due to this some of the following sections may have some similarities with it. The sections referred to are 2.3, 5.2 and Umeå 5

15 Background 2 Background This section consists of a brief introduction of aspects related to this thesis such as workflows, smart workflows and network managements. The section will start by explaining what a workflow is and what the research department at Ericsson defines by smart workflows, of which the analysing system will be a part of. Afterwards a brief explanation on how the network management is working today at the Network Operation Centres (NOC). 2.1 Workflow Engine overview In all normal day-to-day activities most people use implicitly defined workflows. Take for example when one makes ready to start the day and get ready for work. You follow some sort of schedule for when to wake up, so that you have time to eat breakfast and get ready to catch the bus and thus make it in time for work. These flows of events are obviously based on previous experience and knowledge about your surroundings. This can be illustrated by a workflow see Figure 1 below. For a network operator in the telecom market network management, troubleshooting and maintenance are all part of the normal day-to-day work. To model workflows with the ability to capture flows of events, actions and decisions can prove useful for automation. Figure 1: Short example of a day to day routine modelled in a workflow Umeå 6

16 Background Formal languages, such as Business Process Modelling Notation (BPMN) or Business Process Execution Language (BPEL), make it possible to describe and design complex structures using workflows. A traditional workflow usually consists of the following elements. [3] [4] [5] 1. Events that can both trigger the start and stop of parts of the workflow 2. Tasks to be performed by an agent 3. Agents that perform the activities within the workflow, be it a human or computer 4. Connections and gateways between the elements with dynamic expressions that are evaluated before or after an activity to determine the continued flow 5. Roles that group agents together, thus make it possible to assign tasks to groups instead of individuals e.g. Administrators or Management. Necessary steps in creating, executing and working with workflows are as follows [5]: 1. Modelling: In this step, the modeller needs to outline the flow. What activities to include and how to optimize the work towards the goal must be considered. 2. Planning: Choose what algorithms to use where in your workflow, entities to act in it, and dependencies on other parts or systems. 3. Execution/Enactment: External events, timers or changes in external resources can trigger Workflows. Some authors use a fourth step, where humans evaluate the outcome of the workflow [3] [6]. 2.2 Smart workflows (SM) By combining traditional workflows with modern artificial intelligence, one can make the workflows more dynamic and resilient in case of uncertainty, as well as support the human operator more efficiently. By adding the possibility to adapt, change and deploy new sub-parts to the workflow one can change it into a smarter one. Reinforcement learning can help the agents of the system to improve, by adapting to continuous collected data, the correctness of the system s predictions. [7] In the third step above, during the execution of a workflow, Smart Workflows should be able to adapt to changes in its environment(s) and by using feedback-loops, it can reinforce itself and improve future decisions. [7] The system should recognize when a new problem is encountered, one that does not have a (known) solution. Formalizing how new solutions look, making it possible to distribute new solutions to other Umeå 7

17 Background similar systems (e.g. sharing of knowledge between Network Operation Centres (NOCs)). Combining known solutions to form new solutions is possible - once knowledge of how each solution affects the environment in which the system operate becomes available. [7] A simple example - IPTV A report from an Automatic Configuration Server (ACS) combined with other information shows that a Set-top box (STB) is experiencing anomalous behaviour. This is the triggering event that initiates the SM showed in Figure 2 to move from the anomaly detection box. The system automatically calculates whether the current threshold is acceptable given the current load of the network, and then decides whether to act or not. The task can be considered solved once the system gets feedback stating or confirming the success of the recommended solution. To reach the goal, the system will go through a series of workflow steps, namely [7]: 1. Detect: Is there a problem? 2. Analyse: What is the root-cause? 3. Plan: How do we solve the cause of the problem? 4. Act: Assist human operator during execution of plan 5. Feedback: Determine success or failure and improve the system by using this information in reinforcement algorithms. The image in Figure 2 represents the flow of events: Figure 2: Image showing the concept of smart workflows Umeå 8

18 Background Detection Consider the first box Anomaly detection in the simple example with IPTV packet-loss. Under normal circumstances, a threshold rule can be used to decide whether to act or not. In a more advanced system, you would expect the system to adapt its threshold level according to the time of day, overall load on the system and other informative factors that have an impact on the warning you are trying to diagnose. A truly smart system should after it has been active for some time, be able to predict how the line between erroneous and normal behaviour changes during the day [7] Analyse All that the system knows in this step is that something is wrong along with the information that an error-report can bring (tickets, automatic error-logs, ID of unit reporting the problem etc.). Suppose for example that the reported packet-loss is due to misconfiguration in the customer s own equipment. The error on the other hand was reported from another instance, the ACS. The ACS told the system that an STB with a specific ID is experiencing packet-loss. The first sub-step towards identifying any error is to gather information that can have an impact in said scenario. Here, the system will collect information from all the systems that are known to have an effect on the specific reported symptom (packet-loss). In the simplified example provided here this could mean to gather configuration parameters, state-information of the network (load, peers, predicted usage etc.) and other related material. Regardless of what information the system uses, it is required that it learns how to pair certain types of warnings with certain information this becomes crucial when the system tries to bind a solution to specific information (see solution-step). Human operators can be involved in this step as well, telling the system what sub-system to investigate and what information to include helping the learning algorithms in the system to adapt even faster! [7] A successful analysis will present to the user a list of probable causes of the problem, called root causes (presented with probabilities of each cause) Plan In the planning step, the system is supposed to know the cause of the symptoms. As the state of the system is defective, a path taking it from this state to a healthy state is needed. Each active change to the system affects the state of the network, so the question is whether it can find a path through the state-space until it reaches a state considered healthy. Each step in this state-space corresponds to a specific act so the system needs to develop an advanced solution with many simple steps to reach the Umeå 9

19 Background healthy state. If the system is unable to find a path from the current state to a healthy one, one could argue that at least one of the following two possibilities is true [7]: 1. The system has never encountered this specific problem before, and therefore does not know how to fix it (solution missing, or more specifically, no solution is known to remedy the specific state of the network). 2. The system has a solution related to the symptoms/problems, but cannot move through the state-space to a healthy one with that solution (apply the known solution and another erroneous state is predicted). After the operator chooses some action (can be many steps in a single solution), resources allowing that solution to be performed needs to be allocated. Some solutions will be pure automatic ones (e.g. change a setting somewhere), while others will need the assistance of a field-technician. Every solution known to the system has information regarding time consumption, costs and personnel demands, so that it may recommend the solution that is best, seen from a cost / benefit maximization [7] Act During the execution of a solution, the state of the network should (once the solution has been deployed) change somehow. Information regarding the continued progress of a solution is fed back to the operators, and state-information (graphs etc.) can show the operator how the applied solution affects the system [7] Feedback Normal feedback occurs once the task that needs feedback is complete. The feedback in a Smart Workflow will not be in a single step, where the system becomes aware of the outcome of just one solution or prediction. Instead, feedback will be important in most steps, always training the system towards a better understanding of both the problems that can occur as well as how operators do their work [7]. This thesis aims at providing recommendations of algorithms and application methods for the analyse step in the smart workflow Umeå 10

20 Background 2.3 Network managements How the NOC s work today The following sub-section is written in collaboration with Daniel Schiess, thesis student from KTH What is a NOC? A network operation centres (NOC) main tasks is to ensure service level agreements (SLA) 1 on availability and performance, provision customer network services on core equipment and provide support services for engineering and other technical teams. This results in a diverse range of duties for the NOC operators such as network management, troubleshooting and maintenance scheduling [8] Structure of a typical (G) NOC A typical (Global) Network Operation Centre consists of mainly three parts, the Front Office (FO), Back Office (BO) and Field operations (FOPS) where the FO personnel is available 24/7, and BO personnel works normal office hours. Both BO and FO interact with the surrounding world via different systems. FO usually interacts with some Network Customer Support (NCS) system, belonging to a telecom-operator, e.g. Telia or AT&T. The BO personnel works with field technicians (FOPS) and second level troubleshooting, housing network experts and capabilities outside FO s scope [8]. The NOC as a whole has many different tasks, below is a short table describing what different parts of a NOC deals with different problems. The table is not exhaustive, but serves as an example. [9] Table 1: Information on different tasks in NOC-departments Process / Topic Department Sub-Department Alarm Threshold Back Office TX (abbreviation?) NMS Health Check Template Back Office NMS Hardware Replacement Back Office TX 1 SLA specifies to what degree the NOC is responsible for the quality of the network. Also includes demands on the network operators Umeå 11

21 Background SOP for Equipment Configuration Back Office MPBN BO BSS Routines Back Office Core / BSS Troubleshooting e.g. MLH Field FOPS Operations External Alarm SOP Front Office BSS Maintenance Routines Front Office All Incident and Problem Management Front Office All Troubleshooting by FO in a NOC today Troubleshooting in the FO today typically follows the scheme below [8] Observe incoming alarms Verify alarm Check originating elements status Fetch elements information Follow different workflows to find and solve alarm If no solution can be found, send problem to BO Improvements From an internal document reporting of the findings from a site visit in Bucharest, Romania, regarding product improvements, some suggestions were listed. The suggestions were related to user experience and product handling. The following aspects were pointed out [10]: Reduce software upgrades o Software upgrades are a frequent source of interruptions and outages. There should be at most 2 releases a year since the upgrade procedure is very time-consuming. The quality and stability of the upgrade packages should be improved to reduce downtime. Improve alarm information and troubleshooting o Improve notification and troubleshooting system, integrated root cause information. Improve command line interface (CLI) o The CLI should cover all command options and allow for offline and batch configuration operations. Upgrade ServiceOn user experience Umeå 12

22 Background o Improve adaptation to different user needs, troubleshooting functions and synchronization between different analysing tools. Graphical overview of network element (NE) o Graphical overview to guide field engineers on-site during repair and troubleshooting operations. Introduce Customizable trouble tickets o A function to create a skeleton of a trouble ticket directly from the notification. This function could copy alarm information to the Windows clipboard and include hard ware (HW) and software (SW) revision. Simplify field operations o Field engineers frequently need support on product handling during upgrades and related service operations but recurring problems such as firewall access issues slows the work down. Alternative feedback channel to Project Development Unit (PDU) o Slow response when submitting a Customer Service Request. An informal way of reporting improvement suggestions would improve the PDU knowledge on product handling. These suggestions, especially the Alternative feedback channel to Project Development Unit were the starting point that leads to the idea of developing a smart system which focuses on dynamic feedback and reinforcement. Since it today takes long time before suggestions regarding improvement actually take form the aim of this thesis is to find a system that automatically reacts on how they work while troubleshooting Umeå 13

23 Theory 3 Theory This section consists of the findings from the literature study conducted during the work. It will start with an overview of the possible prediction/classification algorithm(s) together with a summary of their properties, 3.1. Next is a definition of evaluation tools for classification algorithms, 3.2. After that there is a section about important properties when reasoning under uncertainty, 3.3. Then there is a major section describing Bayesian networks its properties, pros, cons, etc Next in line is a part regarding explanation methods for Bayesian networks, 3.5. Last one can find a feedback section that presents possible solutions to training and adaptation, Algorithms overview This section contains a quick review of the most commonly used classification algorithms 2 which can be suitable for a support recommendation system. Last in the section there is a short summary of pros and cons regarding the algorithms based on the conditions specified in the Problem definition, section Decision trees (DT) A decision trees are a graphical-based model that consists of decision nodes, end nodes and so called branches or leafs. The algorithm is constructed on the divide and concur concept, at each decision node a choice needs to be taken and the branches represent the possible alternatives. The branch leads to more decision nodes and finally ends up in an end node. The set of alternative branches must be mutually exclusive and all alternatives have to be represented. The starting node, a so called root begins the chain of events that develops based on properties and attributes that can be retrieved in a data base, in the form of historical observations or in carefully selected tutorial example created by an expert of the domains. [11] [12] 2 Classification algorithms aims to categorise new observations into sub groups based on previous knowledge Umeå 14

24 Theory There are many different standards on how to generate decision trees but the most common one is C4.5 and its following successors. Here the outline of C4.5 which is the base for its successors. If is a set of cases, C4.5 first grows as following [13] If all the cases in belong to the same class or is small, the tree is a leaf labeled with the most frequent class in. Otherwise, choose a test based on a single attribute with two or more outcomes. Make this test the root of the tree with one branch for each outcome of the test, partition into corresponding subsets according to the outcome for each case, and apply the same procedure recursively to each subset. There are many different tests on how to choose in the last step. C4.5 uses two different heuristic based criteria to rank tests, information gain and default gain. [13] Decision trees can handle both numerical and nominal data. More complex trees can be difficult to understand but otherwise, decision tree are usually referred to as white boxes due to its ability of showing acquired knowledge in a reasonable form. The algorithm is also known to handle large data sets and there is an alternative method to validate the models statistically making it possible to handle uncertainties. One of the major drawbacks of C4.5 is the CPU time and memory required. Improvements have been made in the successors but it is still an issue. The algorithm also suffers from over fitting which can be handled by either limiting the growth of the tree or prone it. [12] [13] Bayesian network (BN) Bayesian networks belong to the family of probabilistic graphical models, it is a directed acyclic graph (DAG) 3 consisting of nodes and edges/arcs. The nodes correspond of variables with a finite number of states mutually exclusive and exhaustive. The edges represent casual one way dependencies between variables. [14] [15] [16] The graph captures the relationship between variable and identifies direct dependencies and simply updates the local neighbourhood nodes when a change in the systems is made. BN have a qualitative part and a quantitative part. Where the architecture of the network is the qualitative and the beliefs of the states are the quantitative. Thus BN s is a combination of 3 DAG is a set of acyclic nodes with directed edges/arcs in between Umeå 15

25 Theory probabilistic reasoning, graphical theory, computer science and statistics. [17] A more though description of the method can be found in section 3.4 BN s can handlers both numerical and nominal data and due to its probabilistic base it handles inherent uncertainties well. [15] Since the network beliefs can be provided by experts or from statistical data BN s can be modified by its users and thus become more interactive. [17] BN s avoids over fitting of data and can handle incomplete data sets. The network is also transparent / traceable due to its graphical nature and thus is regarded as a white box. The major drawback is that the networks have problems with highly complex systems which requires huge amount of data. [16] One way to avoid complexity is to redesign the model into Naïve Bayesian (NB) architecture Naïve Bayes Networks (NBN) Naive Bayesian networks is a special case of BN s with naive independent assumptions which implies that all attributes from the training data is independent. [16] Even in cases when this condition is not fulfilled, which is in most cases, the algorithms performs surprisingly well. The NB network is regarded as robust, transparent, easy to set up, good at handling large data, handles mixed data and fast adapting to change in data. [13] [16] Support vector machines (SVM) SVM is a group of algorithms that are used for classification and regression. The SVM classifier iteratively constructs hyper planes that separate training data. The hyper planes are chosen so that the margins between the classes are maximises. The larger the distance between groups the smaller error in the classifier. [13] [16] The hyper planes can be of different types where linear planes are of the lowest in complexity. If a set of observations is not separable in the space observations can be mapped from a general space to a high dimensional feature space to enable operations to be performed. [18] The mapping or transformation process is usually referred to as kernelling. There are many different kernel types where everyone has its own specific properties. The original SVM algorithm had problems regarding noise, over fitting and handling errors in training data but additions to it have solved a lot of the problems and today it provides one of the most robust and accurate methods among the well-known algorithms. [13] [19] SVM s are good at handling highly dimensional data and can handle missing with external algorithms but discrete data needs to be rescaled before implemented. [13] [20] SVM s has a clear connection to statistical learning theory and can thus handle statistical uncertainties. [20] For a model with higher complexity it becomes essentially impossible to understand how the given model made its classification and is regarded as Umeå 16

26 Theory black boxes [19] [21]. The two biggest drawbacks with SVM s are CPU time and memory consumption and that the choice of kernel has to be specified by the user. [20] K-Nearest Neighbours (KNN) One of the simplest, most strait forward classifiers of the machine learning algorithms is the k-nearest neighbour (KNN). The basic idea of the KNN is to identify the nearest neighbour of the observation in the sample space created by the training set. [22] Thus the algorithm consists of two steps deciding the nearest neighbours and determining the classes of them. Because the classification is all based on previous observations the algorithm is sometimes called Memory-Based Classification due to the necessity of keeping observations in memory at run time. [13] [22] Two of the key issues that influence the performance are the choice of k (number of neighbours) and how to measure the distances between observations in the sample space. An example is if the number of k is to low the method can be sensitive to noise and if k is too large the neighbourhood may become too big and the classification less accurate. [13] The KNN is transparent and thus easy to implement and debug. To explain the output from a classification analysing the neighbourhood, of which everything is known, can provide a good explanation. [13] [22] The KNN is very appealing due to its simplicity but some of the major drawback is that it is often outperformed by developed methods when the problem is more complex. Other drawbacks are that KNN s are sensitive to irrelevant or redundant variables and due to that the work is done in run-time the performance can be poor is the training set is large. [22] Summary of algorithms Based on the requirements stated and the following algorithm overview Table 2Error! Reference source not found. have been developed to give means of decision about possible algorithms to use in the final analyse tool. The robust column refers to the algorithms ability to handle noise and inaccurate training data. The transparent column refers to the tractability of the algorithms. The mixed data column refers to the algorithms ability of learning from mixed data sources. The large data column refers to the algorithms ability scale up in size and thus handles large datasets. The probabilistic column the possibility of handling uncertainties in the data and shows it in a reasonable way to a user. Last one has the adaptive column which refers to the algorithms ability to change and adapt its outcomes based on feedback Umeå 17

27 Theory Table 2: Comparison of classification algorithms Method Robust Transparent Mixed data Large data Probabilistic Adaptive DT BN SVM KNN From Table 2 one can see that Bayesian networks are the only algorithm that satisfy the stated conditions for algorithms (see 1.1.1) and thus an additional section (3.4) will be devoted to explore the area. 3.2 Evaluation method In the constructing of any classifier one always has to evaluate the uncertainty of the classifier. It is important to be able to measure this quality. Several different statistical measures can be used for this but one of the most common is called confusion matrix. In the binary case where the classes are limited to positive or negative, it tracks the predicted class and real class in the following four parameters. [16] True positives (TP) number of positive predictions that are correct True negatives (TN) number of negative predictions that are correct False positives (FP) number of positive predictions that are incorrect False negatives (FN) number of negative predictions that are incorrect The concept can be expanded into a multinomial case but for simplicity considers only the binary. The aim of any network is thus to maximize the true positive (TP) and true negative (TN). The effectiveness of the net is usually evaluated with recall and precision measurements defined by Recall measures the ability to find all relevant entities and precision measures the ability to retrieve only relevant entities. Although precision and recall in theory is independent, an increase in precision Umeå 18

28 Theory results in a reduction of the recall and vice-versa. An additional evaluation tool is the F-score that measure a combines of the recall and precision together [16]. Once a classifier has been constructed it needs to be evacuated using the evacuation measurement. There are several different validation techniques such as simple split, bootstrapping, holdout method and k-fold cross-validation. Simple split method divides the data set at hand and trains the classifier on one half and validates on the other. The holdout method randomly picks out two mutually exclusive groups, 2/3 of the original group for the training and 1/3 for the validation. The bootstrapping method consists of selecting samples from the original dataset with replacements to form a new data set with observations. The algorithm is then trained on the new set and since some of the observations can be selected more than ones there will be others that are never picked and thus can be used in the validation. The k-fold cross-validation method randomly partitions data into the equally sized sets. One of the parts is then used for validation, and the remaining for the training. The process is then repeated times so that each part is used once for the testing so that an average of the measures tools can be calculated and used in the evaluation. 3.3 Probabilistic reasoning Uncertainties are always a problem when working with classification/diagnostics and it can enter the problem domain in several different ways. The data can be uncertain, incomplete or/and discretized and thus only approximately right. Probability reasoning provides a structure for processing uncertainties in relationships between the variables of a probabilistic model. It gives consistency of the derived results and provides an appropriate instrument for presenting uncertain in the results. [14] Probabilistic languages consist of four basic concepts likelihood, relevance, conditioning and causation. [23] Likelihood measures how probable an event is too happened and the event is conditional on another event if the likelihood changes when knowledge about the other is taken into consideration. A set of events is said to be relevant to each other if adding knowledge to an additional event on top of information regarding parts of the set changes the likelihood of the rest of the set. For example, if event A and B are causes to C and we have knowledge about B and add knowledge about C, this influenced the likelihood of A i.e. sets of events become relevant if common consequences is Umeå 19

29 Theory observed. [14] Causation explains the dependencies between events, for example a set of events becomes independent if knowledge about the causes is provided. 3.4 Bayesian Networks A Bayesian network is a directed acyclic graph (DAG) developed for reasoning under uncertainties and is usually represented by two parts, D and P that together describes the joint probabilities for a set of random variables. D represents the variables as nodes and dependencies between them as directed edges/arcs. [1] The edges are usually associated with directed casual influences 4 and the nodes are then denoted to have so called parent/child relations. For example, if nodes A and B are connected and the edge goes from A to B then A is called a parent to B and B is called a child to A [15]. P represents the conditionally probability distribution and is specified for each variable { ( ) ( )} where are the parents to. With the Bayesian chain rule, P uniquely defines the probability distribution for the set of variables in [15]. ( ) Each variable/node can have discrete or continuous values but in the first case the states need to be a finite number, mutually exclusive and exhaustive. [1] Believes about discrete variables is represented in conditional probability tables, so called CPTs and for continuous variables in conditional probability functions, so called pdfs. The tables contain all possible combinations of the states in and its parents. Although both types of variables are allowed the continuous variables needs to be able to be assigned a linear conditional Gaussian distribution and not have discrete children. If the two conditions are not meet the variables needs to be discretised. [15] The discretisation can introduce new problems but several different methods have been developed to handle the transformation of data, but most with different assumption that then also needs to be fulfilled. [24] 4 Casual connections: Connections that make information able to propagate from one side of the graph to the opposite side without any direct connection. [24] Umeå 20

30 Theory To build a BN both D and P is required to be specified. In the building process D is regarded as a qualitatively part and is recommended to be set up by a domain expert while P is the quantitatively part and can usually be calculated from historical data, but can also be specified by experts. [17] The recommendation to let domain experts specify D is due to that learning the structure of a network has been proven to be NP-hard. [24] To give a simple example as to illustrate the issue, if the number of nodes in a network are 10 the possible set of different DAGs are. Even though the complexity, several algorithm have been developed to learn the structure of a BN. The various techniques usually are divided into three subgroups based on their methods. There is a score and search based approach, constraint-based approach (uses conditional independencies found in the data) and a dynamic programming approach. [24] A more thoroughly investigation of the area is out of the scope of this thesis but interested readers can get overview of the area in the article [24]. BNs can be used to understand how a change in one node can effect another node or even a node further away. For example if one knows it is raining outside the likelihood that the grass is wet changes. There are a few different ways nodes can be connected. [1] Serial connection: If node A changes the likelihood of B and B changes the likelihood of C then knowledge about A will influence C and vice versa. But if the knowledge about B is provided the connection is broken. (Figure 3) Figure 3: Serial connection Diverging connections: If A is a parents to two or more children, likelihoods can change in all the descendants except if knowledge about A is known. The children are then so called d-separated given A. (Figure 4) Umeå 21

31 Theory Figure 4: Diverging connection Convergence connection: If A is a child of two or more parents and A is unknown, the parents are independent of each other i.e. any knowledge about a parent does not give any new information relevant for other parents. But if A is known any information about a parent may tell something about another one. (Figure 5) Figure 5: Convergence connection A sum up of the different connections and there properties: Information can be carried in a serial connection except if the stat of a node in the connection in know and thus breaks the chain. Information can be carried in a divergence connection as long as the there is no information about the specified node Umeå 22

32 Theory Information can be carried in a convergence connection if the node or any of its descendants are known. Thus two nodes are conditionally independent if all intermediate paths between them have at least one known state in a serial or divergence connection breaking the chain or have at least one unknown state in a convergence connection breaking the chain. Bayesian networks can have different interpretations depending on context it is used in and the experience the developers have. The basic concept of a Bayesian network is a factorization of joint probabilities distribution that makes storage and inference easier. In a Bayesian network inference is generally referred to as finding the values of a given set of variables that best explains why a set of other variables are set to certain values. [24] Thus to use a BN for classification/inference a concept of evidence, is introduced. The evidence is the assignment of findings, setting values to a subset of the variables in. [1] With the evidence it is possible to calculate the posterior probability of any unobserved variable not specified by the evidence, hence likelihood of all possible causes can be found based on the known state of the system. The likelihood can be calculated by the Bayes rule 5 where is the probability of the state described in the evidence occurring if the cause is known, is the prior probability of the cause and the prior probability of the evidence if nothing else is known of the system. Consider the following example to illustrate the information regarding Bayesian networks given previous in this section Umeå 23

33 Theory Figure 6: Made up example with inspiration from [25]. The same image can be found in the thesis Decision Support Systems: Workflows by E. Daniel R. Schiess The graph in Figure 6 illustrates a system for troubleshooting one section of a rooter regarding valid IP address. The Bayesian network domain is set up to consist of three variables, if the computer uses automatic IP querying, if the computer have the correct IP settings and if the DHCP server is working. For the Bayesian network, CPTs are showed in Table 3 who in this case is put in by a user but could just as easy be calculated from historical data. Automatic IP Table 3: CPTs for made up example illustrated in Yes No Umeå 24

34 Theory Correct IP Settings DHCP delegation problem Auto IP Yes No Yes 0.02 Yes No 0.98 No Valid IP Address Correct Yes No DHCP Yes No Yes No Yes No The example network represents a joint probability table via the Bayesian chain rule. Assume now that one have the evidence and ask the question What is the probability of a valid IP address if we know that the correct IP settings are used? (no matter if they are received automatically or set manually) i.e.. From the properties of serial connection described earlier in this section, one knows that the evidence cuts the connection of influence from the automatic IP. Thus the restoring calculations to get the probabilities for valid IP address is [17] { Umeå 25

35 Theory Naïve Bayesian Network A naive Bayesian classifier (NBC) is a special case of BN s that uses the strong assumption of independency among the variables, usually referred to as naïve Bayes. In a real world scenario the variables in a system would seldom fulfil the assumption, regardless this surprisingly simple setup has been proven to perform very well and can be competitive with state of the art classifiers such as C4.5. [26] One of the reasons for this is due to that classifiers are only interested in the class with maximal probability and not in the exact probability distribution. [15] Under the independency assumption, the calculations of the conditional probability distribution for the desired variable given the evidence are very easy. In a naïve Bayes classifier each variable generally only have the classifier as its parent which means that the structure is fixed and the only work involved is learning the probabilities. Due to the flexibility, easy learning and to use the NBC is widely used. [15] The major drawbacks with NBC s are its ability to identify rare classes/happenings which is often identified through a set of values appearing together, but where each value on its own does not indicate the specific case. NBC s cannot comprehend these situations due to the independent assumption given the class and hence one may want to extend the NBC s to a more elaborate dependency structure among the variables. [15] One of these additions to the classical method is the tree augmented naïve Bayesian classifier (TAN) which tries to get a structure that has maximal likelihood, a through explanation can be found in [15]. The general procedure of an NBC however can be summarized in the following example [15], illustrated in Figure 7. Let all possible causes be collected in the variable with the probabilities. Obtain, for all variables the conditional probability distribution,. For any observation of the information variables the likelihood can be calculated by. The posterior probability can then be calculated from the Bayes rule where is the normalization constant Umeå 26

36 Theory Figure 7: Naïve Bayes model Once again consider the IP address example from the main section of Bayesian network (3.4) but in this case guild a naïve Bayesian network. Figure 8: Made up example with inspiration from [25]. The same image can be found in the thesis Decision Support Systems: Workflows by Daniel Schiess Umeå 27

37 Theory Figure 8 illustrates the same made up example but this time with a naïve approach which in turn changes the associated CPTs. Table 4: CPTs for made up example illustrated in Figure 8 Automatic IP Yes No Correct IP Settings DHCP delegation problem Yes 0.73 Yes 0.02 No 0.27 No 0.98 Valid IP Address Correct Yes No DHCP Yes No Yes No Auto Yes No Yes No Yes No Yes No Yes No The example network represents a joint probability table via the Bayesian chain rule Umeå 28

38 Theory Assume once again that one have the evidence and ask the question What is the probability of a valid IP address if we know that the correct IP settings are used? The calculations to get the probabilities for valid IP address is then becomes [17] { 3.5 Explanation methods and properties The following section goes through the concept of explaining outcomes from a Bayesian network. The section start with a short introduction of what an explanation is, It continues with listing important perspectives and properties of an explanation, Last some explanation methods are listed and defined, What is an Explanation Explanations of how a system is reasoning are one of the key factors to implement a successful expert system. An explanation is usually regarded as, expressing something in a way so that it is understandable for the receiver. The explanations also aim at giving the receiver a satisfactory cover about the object at hand. [27] Thus before developing an explanation the following aspects needs to be regarded; what to explain, with whom the system is interacting and how the system should interact. In the case of BNs these basic concept can to be converted to explaining the knowledge base, the reasoning process and the conclusions. These three aspects can further be described by: [27] Explanations of evidence: Involves identifying the most probable explanation matching the evidence. The process is carried out by calculating the posterior probabilities for all unobserved variables and thus enables the system to find influential variables, based on some criteria specified in the algorithm that can explain the evidence. This process is usually called abduction and can roughly be interpreted as how certain the system is of its classification Umeå 29

39 Theory Explanation of the reasoning process: Consist of explaining how the system resulted in its classification. This gives the user not only a tool for interpretation the results but also debugging and developing a system. As an evaluation tool it can be used to check that all steps on the way to a solution are correct. In the case of debugging an expert system, the tool can be helpful by finding the error in a knowledge base. In a developing stage, explanations are often a necessity to convince the user, which is usually an expert of the domain, that the system has a correct result. Explanation of the model: Consists of showing the user, by words, graphs or in any other way the information of which the classification is built on i.e. parts of the knowledge base. The objective of this part is to help a constructor of a new system during development, but the explanation can also give a beginner some intuition about the domain Important Perspectives of an Explanation There are several different aspects on how to develop an explanation method. To begin with explanations can be divided into two groups based on their purposes, descriptive and comprehensive. [27] The descriptive methods aim at showing the underlying knowledge and provide more details and information regarding the classification. The comprehensive methods on the other hand try to give the user an understanding of the model and its implications e.g. how each finding in the evidence affects the final outcome. Another crucial aspect is how thoroughly an explanation needs to be and what perspective it should have. Two different perspective levels usually mentioned in literature is micro and macro levels. An explanation on the micro level aims to give the user detailed description on how one node varies when its surrounding nodes are changed. A macro level explanation on the other hand aims at describing the rough outline of reasoning and takes the user through the most important steps from evidence to classification. [27] But these perspectives dose not answer the question when to present and how thoroughly it should be. Since one of the basic idea of an explanation, as mentioned earlier, is to giving the receiver a satisfactory cover about the object at hand the system should be dynamic and adapt the amount of information based on the receiver knowledge. The amount of the detail and overall quantity should vary depending on the knowledge of the domain and reasoning method. [27] This gives rise to the question on how to find the level of experience the user have of the system? A thoroughly investigation of the area is out of the scope of this thesis. Although several different literatures mentions ranking systems, for example as apprentice, journeymen and master or giving to the user the possibility to interact and retrieve as much information as desired. There are several different ways to communicate information gathered from the system. Different successful expert systems have been known to use a diverse set of communications methods. The first step when developing a presentation is to decide if it should be in verbal, graphical or multimedia Umeå 30

40 Theory form. [27] Verbal form includes using text or/and numbers while graphical form includes diagrams, plots, workflows or/and present the BN layout. Using a multimedia presentation includes both the verbal and graphical information together with images, videos and sound in a well fit multimedia environment. Since a large part of explaining the outcome from a BN consists of presenting probabilities the next step natural is to decide how to present it. Probabilities can be expressed in numerical, quantitatively, linguistic or qualitative form. [27] In numerical or quantitative form the probabilities can for example be presented as 0.54 or 54% or perhaps even as odds between two parts. On the opposite the linguistic or qualitative could use expressions such as often, seldom, likely, almost sure to explain its outcome Explanation methods Several methods have been developed in order to explain the outcome of a Bayesian network. As mentioned earlier the different methods aim at explaining evidence, reasoning process or/and the model used. Generally if a domain has multiple target variables it is often more appropriate to present explanations for the given evidence so called abduction. [28] Another suitable method group that is frequently used to present the outcome is tree based method. One of the most commonly used are the explanations tree (ET) method that aims at finding the best explanation of the observed variables and then illustrates it by a tree structure. [29] Abduction methods When using abduction methods it is regarded that the best explanation for the given evidence is the state of the system configuration, which is most probable given the evidence. Thus the aim with an abduction methods in the context of BN, given some evidence, is to find a configuration of the network that has the maximum posterior probability e.g. to find the best configuration of the explanation variables so that it also is consistent with the cause. [27] The process to find a possible state of the system can be pictures as if we observe, and have the rule that the cause, can give the system state,. Then one can assume that is a probable hypothesis/explanation for the existence of. [30] A drawback is that there often are several different good explanations and so it is necessary to choose the best one among them. Thus the way to find the best explanation in an abduction algorithm is divided into two steps: generation of explanations and a selection among them. Figure 9 [31] Umeå 31

41 Theory Figure 9: Illustration of the process to find the best explanation To make the necessary selection the two following criteria are often used: metric based criteria and simplicity criteria. The matric based criteria takes into consideration such as weights, probabilities etc. and the simplicity criteria states that the best explanation is often the simplest one. [31] Literature usually mentions two groups of abduction methods, total abduction often referred to as most probable explanation (MPE) and partial abduction often referred to as maximum a posteriori assignment (MAP). Most probable explanation includes all the unobserved variables, into the explanation. The best explanation is then the assignment, that has the maximum a posteriors probability given the evidence [29]. However the algorithm often has several competing explanations so the next step is to limit the number. This problem has been studies by several scientists and numerous algorithms have been developed. One of the approaches to solve the problem is to limit the number to the best MPEs and it has been proven that the K-MPEs can be found, under certain assumption, with similar complexity as the computational cost as the MPE. Unfortunately there are some common drawbacks with this algorithm. The explanations can be too specified and may contain unimportant variables and it can also be difficult to distinguish in between different explanations which are often long and similar to each other. [17] [29] [31] Maximum a posteriori assignment tries to address the problems associated with the MPE method by considering the target variables as a subset of the total unobserved variables so called the explanation set. Generally there are some variables that clearly have no explanatory values, while others may be intermediate nodes or simply does not contain any information about the cause that is explained. Thus we look for the maximum a posterior assignment of these variables given the evidence Umeå 32

42 Theory where is the set of variables that contains all elements in that are not in i.e.. This method is generally less efficient then the MPE and uncertainties are entered into the system and thus make it more complex. [28] [31] [29] Now the question about which variables should be used in the explanation arises. There are several different methods to approach this problem and many systems avoid the question by assuming that the explanation set is provided by a domain expert. Others have chosen to include only the ancestors to the cause in the explanation. However a solution to the problem cannot be handled with these rather simple suggestions if the network is large and complex. Some of the more known and advanced models are scenario based explanation and Occam s razor. In the scenario based method a tree of proposition is developed where the path from root to leaf illustrates a specific scenario where the scenario with highest probability is the target. It is possible to have partial explanations but then it needs to come from a restricted number of predefined examples. In the Occam s razor method conciseness is the main focus i.e. the user should not be presented with unnecessary detail only the most influential variables of the complete explanation. [17] [28] [29] Tree based methods Tree based explanation methods all aims at finding the best explanation of the observed variables and represented them in a so called tree structure. Every inner node in the tree represents one of the variables in the explanation set. From the node branches extend and reach towards the next node in line, each branch represents possible states for the specific explanation variable. The evidence determines the assignment of the node and hence gives the path from root to cause. The basic idea is simple, but how to obtain the tree is not trivial. There are two tasks in the generation process [29]: How to select the next variable in line? How to decide when to stop expanding? Several methods have been developed to answer these questions. One of those methods is called explanation trees which is a greedily branching based on the most explanatory variables and at the same time maintain the probability of each branch above a certain thresholds. The method uses two types of information measures, mutual information (I) and GINI index (GINI), to evaluate the questions mentioned above Umeå 33

43 Theory ( ) ( ) ( ( ) ( ) ) ( ) ( ) The resulting algorithm will obviously have some different properties depending on the choice of evaluation method. The measures are compared with user-specified thresholds to decide the next step. The method picks new nodes in the order of how informative it is to the remaining variable, not to the cause. [28] [29] [32] The drawback with this method is mainly related with the choice of expandable variable. Variables choice based on information gain of the remaining variables dose not necessary improve the uncertainty of the cause. The ranking system, that is used to choose next variable in line, is also known to be sensitive to the set thresholds and thus human errors. [28] [32] The method has been further developed into the so called casual information flow where casual dependencies are taken into account during the process of selecting variable. Even though this method performs better then ET and is good for identifying important individual variables the method inherits drawbacks from the previous method. It is still subject to user errors and may fail to identify a combined effect of multiple variables. A description of the explanation algorithm can be found in section Feedback When the decision support system is up and running the performance may be working sufficiently or may need to adapt itself based on feedback from the users. There are several different types of possible feedback paths for a Bayesian network. The most obvious one being to gather new training examples during the run time into batches and relearn all the parameters after a set amount of time or quantity of data collected so called offline learning. This obviously requires storage capacity and may add issues regarding accessibility of the system during the training phase. [33] Another way to include feedback is to use a so called online learning which learns the parameters from data as it is generated. This method is designed to handle situations where the environment changes or when the model needs finetuning. [34] There have been several algorithms developed in this area where some of the more popular ones are the framework described in [34] and the voting EM described in [33] Umeå 34

44 Theory In the first example mentioned, the framework provides a method usable both for online learning and the more classical batch learning, which is a type of offline learning. For the batch case the method includes both the gradient projection algorithm and the EM algorithm 6. For the online learning case it enables a parameterized version of EM. The basic idea is to is, given a BN, described by parameter vector and a data set, create a new that is described by. There are two factors that affects the outcome of the, how good it fits and the fact that the change between should be small. These factors are included when the process of is found from the following equation [33] [34] [ ] [ ] where is the normalized log likelihood, is the distance between the two models and is the learning rate. [33] This method leaves the question on how to weight the additional data. This problem is addressed in the second example, the voting EM which essentially is a modified version of the framework example described above, the definition of is changed into [33] { [ ] ( ) ( ) ( ) where is the estimated probability of parents given the evidence and the previous estimated network and is the learning rate (as the past is weighted less). This method is proven to adapt to changes and be able to escape local maxima. Furthermore the method can handle complete and missing data cases and it converges closely to the true CPT parameters quickly. [33] 6 Expectation-maximization algorithm : iterative method to find the maximum likelihood by alternating between log-likelihood evaluation using the current parameters and the maximization step to find parameters that maximises the expected log-likelihood found in step. [35] Umeå 35

45 Method 4 Method To achieve the objectives stated in the problem section two methods have been identified. The first is a literature study to choose appropriate algorithm(s) and investigate how to explain outcomes from the analysing tool. Second, experiments are conducted in order to understand challenges in applying telecom network data into the methods. 4.1 Literature analysis To get an introduction to recent research made on the area and investigate state-of-the-art solutions a literature study is conducted. The focus of the study is to find proper tool to develop a classifications system that means to give decision support. The study also focuses on investigating different explanation methods with their properties and feedback possibilities. Bayesian networks is early pointed out as a possible candidate for the future work and a more thoroughly investigation of its properties and explanations methods is conducted. All findings from the literature analyse are collected in the theory section. Information regarding the area has primarily been collected from Springer Link, CiteSeer, Google Scholar and from the library at KTH. Information has been found mainly based on searches of the following keywords: Bayesian networks, Explanation methods for Bayesian networks, Classification algorithms and Probabilistic Graph theory. Another part of the literature study was made via the online courses Machine Learning and Probabilistic Graphical Models from the online course website Coursera. The most recent and related paper regarding this thesis were published in the Artificial Intelligence Research Journal in December 2012 and handles diagnosing root causes of failure from trouble tickets in a telecommunication network with the help of Bayesian networks. 4.2 Experimentation During experimentation, a recurring part of the thesis, ideas and concepts were implemented. The implementations were often short and were performed mainly to further understand the specific Umeå 36

46 Method system being investigated. In the experiments section mainly three programs have been used R, Activity and Matlab. Since no satisfying explanation method was found during the literature study a newly developed explanation method will be defined and in section Umeå 37

47 Experimentation 5 Experimentation This section describes all the experiments conducted during the work. It starts with a basic testing of accuracy of different classification algorithm and shows one of the more promising ones (Bayesian networks), 5.1. Second is an experiment testing the capability of binding a workflow engine with a mathematical platform, enabling a dynamic way to enhancing the workflow components, 5.2. The following section tests an implementation in Matlab of a NBC model for further testing, 5.3. The next part describes a new way to explain the outcome from a BN, 5.4. Last is an implementation of the new explanation method, Evaluation of classification algorithms To get a basic understanding over the different classification algorithms some basic starting up experiments were conducted with some well-known benchmark datasets. The first one is a wine quality dataset that include 13 different continuous properties with 178 observations from 3 different quality levels. The second one is a flower species dataset that includes 4 continuous properties with 150 observations from 3 different flower classes. The third and last dataset is based on the survivability of the passages from the titanic incident. The set includes 3 discrete variables with 2201 passengers that list if each one survived or not. The datasets can be found from numerous online sources for example on the website for Machine Learning Repository 7. Many basic computer experiments were made in this early stage e.g. classification with naïve Bayesian networks with the titanic survival data. This was made with the help of R libraries bnlearn available for free download within the newest R software (3.0.1). The experiment gave following results in the measurement scores described in section 3.2. The titanic survival data gave the confusion matrix below when using 1/3 of the data to cross validate the algorithm Umeå 38

48 Experimentation Table 5: Confusion matrix from the naïve Bayesian classifier with the titanic data, 2/3 used as training data and 1/3 as validation data. True Predicted Survived Dead Survived Dead 8 97 From the table we can see that classification algorithm makes few errors with this specific dataset. The recall, precision and F-score can then be calculated. The values are a verification that the NBC algorithm can work as a fast and easy implementation on basic discrete data. 5.2 Connection R with Activiti To get a prototype analysing tool up and running we needed to connect a calculation platform to the workflow engine used in the NOCs today. Thus we created a simple example with the Activiti workflow engine. Activiti was deployed on a Tomcat webserver and connected to the R sever via a java class platform to write customised workflow components. In Eclipse, which is a Java IDE editor, code was written to connect the R server to the workflow engine and the example showed in Figure 10 was created. The example is about troubleshooting an IPTV network with four hardware components; Broadband Remote Access Server (BRAS), Digital subscriber line access multiplexer (DSLAM), settop-box (STB) and a root gateway (RGW). All the components have their internal settings and we mocked it to be in total 14 different ones. This was mainly done to verify a connection establishment, communication between components and to visualise the working process. Thus no real data was used and no heavy calculations were carried out in the process. A guide on how we set up the connections can be found in 8.2. When the workflow is executed the request is made via the Eclipse script and Umeå 39

49 Experimentation connects to the R server whom then executes the specified R-script to make the recommendation and the following answer is received. Probability of error at: STB: 0.66 BRAS: 0.15 DSLAM: 0.03 GW: 0.16 The most probable error is in: STB Based on previous data the most influencing variables for making this conclusion are abnormal values on: STB FW edition, STB loads CPU and if the STB connected is over WIFI. Figure 10: Workflow set up in Activiti with mocked output from data from R servers to give an intuition on how the workflow may look like. This shows us that Activiti based workflow engine is possible to connect with R as a calculation platform and thus utilises all its available packages which gives a dynamic possibility of creating new components Umeå 40