University of Amsterdam MSc Stochastics and Financial Mathematics Master Thesis Probabilistic Risk Analysis in Business Continuity Management Author: Yu Xu 10311009 bluefishxy@gmail.com Supervisor: ing.egbert Smit ABN AMRO Bank N.V Examiner: DR.Bert van Es Universiteit van Amsterdam August 27, 2014 Universiteit van Amsterdam
To my parents and my girlfriend...
Abstract Probabilistic Risk Analysis in Business Continuity Management Yu Xu This thesis explores the business structure and dependencies within a bank using Business Impact Analysis (BIA) and Business Continuity Risk Analysis (BCRA). The aim is to investigate criticality and vulnerability to damage of its components. In the first stage we query the connections between processes of interest and buildings or applications where damages and attacks observed are stored in a business graph database. Subsequently, the structure can be transformed into a quantitative world where centralities are computed to show the properties of the network. Furthermore, we build a Bayesian network in the second stage, which involves probabilistic analysis of business continuity management (BCM). The risk probabilities are estimated and the dependencies of business components are represented by conditional probabilities. In addition, to answer the probability questions given evidence, i.e. value of certain components, Bayesian inference algorithms are proposed. In order to validate the accuracy of input parameters, we present a sensitivity analysis to examine their interactions. An application of the probabilistic model is the Value at Risk (VaR) that combines risk probability distributions and loss distributions to calculate the maximum loss for which the likelihood does not exceed a certain confidence level. Key words: Business Impact Analysis, Business Continuity Risk Analysis, Business Continuity Management, Bayesian network, Value at Risk ii
Acknowledgement At the beginning of 2014, I was thinking about what I am going to do with my final thesis. Half a year passed by and I am sitting in a large, bright and busy office at Foppingadreef, enjoying my last month working in ABN AMRO Bank N.V. I didnt have any working experience before this internship, but I have adapted to the full daily working life now. I have spent an amazing time in the bank and will never forget how this precious experience stimulated and enhanced me. My daily supervisor in the bank, Egbert Smit, who was also my interviewer, gave me a lot of help on the research project. He educated me about business continuity management and guided me a lot on my working progresses. I would like to express my gratitude to other colleagues in BCM department: Marc van Doorenmaalen, Wim Hut, Marga de Lange, Irene Lewis, Robert Dieben and Nazima Guman. They never hesitated to help me when I asked. My thanks also go to the whole CISO MT, where I met many interesting people and learned a lot from them. I would like to thank specially Bert van Es, my supervisor at the university, who gave me beneficial lectures as well as lecture notes. I had inspiring discussions with him during our meetings, in which I appreciated his opinions and recommendations to my research. I would also like to thank Peter Spreij and Erik Winands, who gave me a lot of instructions on looking for an internship opportunity. Last but not the least, I would like to say thank you to my parents. We had video chats every week and they always concerned and encouraged me both in my work, study and daily life. Thanks are also given to my girlfriend; we had a in relationship for almost four years but two years living apart. She means a lot to me and I have to express my apologies that I couldnt accompany her when she needed me during these years. I love my family and all the friends who shared an exciting time with me in Amsterdam. Yu Xu
Contents Abstract List of Figures List of Tables ii vi viii 1 Introduction 1 1.1 Business Continuity Management........................... 1 1.2 Literatures and Research Methods.......................... 2 1.3 Thesis Structure.................................... 4 2 Business Impact Analysis 6 2.1 Business Graph Database............................... 6 2.1.1 Business Structure in a Bank......................... 7 2.1.2 An Application of Query in Neo4j...................... 7 2.2 Business Network Analysis............................... 9 2.2.1 Basic Network Concept............................ 9 2.2.2 Centrality.................................... 12 2.2.3 An Application of Centrality in Gephi.................... 14 3 BCRA: Model Description 19 3.1 Graphs.......................................... 19 3.2 Bayesian Network.................................... 21 3.2.1 Independencies................................. 21 3.2.2 Factorization.................................. 24 3.3 Local Probabilistic Models............................... 26 3.3.1 Tabular CPDs................................. 26 3.3.2 Noisy-OR CPDs................................ 27 3.4 Application of a Bayesian Network in Business Structure.............. 30 4 BCRA: Risk Probability Analysis 35 4.1 Risk Data........................................ 35 4.1.1 Natural Disaster................................ 35 4.1.2 Other Risks................................... 38 4.2 Distribution Fitting.................................. 39 4.2.1 Goodness of Fit................................. 39 4.2.2 Pareto Distribution............................... 41 4.2.3 Application................................... 42 4.3 Parameter Uncertainty................................. 46 4.3.1 Bootstrap Confidence Interval......................... 46 4.3.2 Application................................... 50 iv
4.4 Parameter Learning.................................. 50 4.4.1 Dirichlet Distribution............................. 50 4.4.2 Bayesian Estimation.............................. 51 4.4.3 Application................................... 53 5 BCRA: Bayesian Network Analysis 55 5.1 Bayesian Inference................................... 55 5.1.1 Inference Algorithms.............................. 55 5.1.2 Inference Scenario Tests............................ 56 5.2 Sensitivity Analysis................................... 57 5.2.1 Functional Relationship............................ 58 5.2.2 Application................................... 61 6 BCRA: Value at Risk 65 6.1 Definition........................................ 65 6.2 Application....................................... 66 7 Conclusion and Suggestions 69 7.1 Conclusion....................................... 69 7.2 Suggestions....................................... 70 7.2.1 Bayesian Network............................... 70 7.2.2 Probability Analysis.............................. 71 7.2.3 VaR....................................... 71 A Maximum Likelihood Estimation of α 73 B Sample Results of Other Risks 74 B.1 Wind Speed Data.................................... 75 B.2 Water Level Data.................................... 76 v
List of Figures 1.1 Bank Risk Taxonomy................................. 2 1.2 Business Structure In A Bank............................. 3 1.3 Phases of BIA...................................... 3 1.4 Phases of BCRA.................................... 4 2.1 Graph Elements and Properties............................ 6 2.2 Overview Nodes and Relationships of Business Structure.............. 7 2.3 Business Processes Nodes with Names and IDs................... 8 2.4 Overview of Output Result in Neo4j under the Restriction of 10 Applications.. 9 2.5 Undirected and Directed Graph............................ 9 2.6 Node with High Betweenness Centrality and Low Degree Centrality....... 13 2.7 Data Importing from Neo4j.............................. 14 2.8 Grouped, Colored Network and Layout in Good Visualization........... 15 2.9 Nodes Properties and Metrics in Data Laboratory................. 15 2.10 Out-degree of Business Process and in-degree of Application. The smallest value of out-degree in Business Process has BP24, which has one process and one RTO. The largest value of in-degree of Application have APP1045 and APP1033, they are both used by 21 processes................... 17 2.11 Betweenness centrality of Process. The most influential one is P352 under this measure....................................... 17 2.12 In-degree and eigenvector centrality of Building. Both the metrics indicates that BU10016 has more impacts on business lines than other buildings have.. 18 2.13 Out-degree, betweenness and eigenvector centrality of Business Line. BL5011 has the lowest out-degree (only located in one building), and the largest values of betweenness and eigenvector centrality have BL5017 and BL5016..... 18 3.1 An Example of Directed Graph G........................... 20 3.2 Flow of Probabilistic Influence in Bayesian Network................ 22 3.3 Noisy-OR Model.................................... 28 3.4 Building Risk Model.................................. 31 3.5 CPD of Risk Nodes, Building Nodes and Business Process Nodes in Different Types.......................................... 32 3.6 Application Risk Model................................ 33 4.1 Netherlands earthquake magnitude data from KNMI, 1990 to 2014........ 36 4.2 Water level and flow strength reference example of Katerveer, the Netherlands. 38 4.3 ECDF and Fitted CDF................................ 43 4.4 Density of Four Fitted Distributions with the Empirical Data........... 43 4.5 ECDF and Fitted CDF with Limit.......................... 44 4.6 Log-log Scale Plot and Fitted Distributions..................... 45 4.7 Bootstrap Replications of α.............................. 49 4.8 QQ-Plot of α and log(α)................................ 49 vi
4.9 Ratio Fluctuation of α i+x i α+x During Learning..................... 53 4.10 Random Numbers Generated from Dirichlet Distributions............. 54 4.11 Surface of Dirichlet(95, 4, 1), 1000 samples..................... 54 5.1 P (P rocess Hack) vs P (Hack)............................ 62 5.2 P (fail P 124 yes Hack) vs P (fail P 232 fail P 124)............... 63 5.3 P (fail P 232 yes Hack) vs P (fail P 232 fail P 124)............... 63 5.4 (a) P (fail P 124 yes Hack) vs P (fail P 121 fail P 124) (b) P (fail P 124 yes Hack) vs P (f ail P 236 f ail P 124) (c) P (f ail P 124 yes Hack) vs P (fail P 234 fail P 124)............................... 64 6.1 Aggregated Loss Distribution............................. 66 6.2 Loss Distribution in Log-log Scale.......................... 67 6.3 Simulated Loss Data.................................. 67 6.4 Aggregated Loss Distribution: (a) Small Loss Amounts (b) Large Loss Amounts 68 7.1 VaR, CVaR Deviation................................. 71 B.1 Netherlands Wind Speed Data and Fitted Density................. 75 B.2 Pareto Fitting with Best Fit but Too Large x m =17.3 (close to 20)........ 75 B.3 Pareto Fitting with Good Fit and Appropriate x m =10.3 (far from 20)...... 75 B.4 CDF of Arnhem and Culemborg brug........................ 76 B.5 CDF of Den Helder and Deventer........................... 76 B.6 CDF of Eemshaven and IJmuiden........................... 77 B.7 CDF of Katerveer and Roermond boven....................... 77 B.8 CDF of Rotterdam and Westkapelle......................... 77 vii
List of Tables 3.1 A summary if a trail is active or not depends on whether W belongs to evidence set Z........................................... 23 3.2 An example of CPT of Business Process 2 in Figure 3.2.............. 26 3.3 Parameter table of Noisy-OR model when n=3................... 29 3.4 CPT of an effect variable with 3 parents that transformed by Noisy-OR model. 29 3.5 Noisy-OR model of example in Figure 3.2...................... 29 3.6 CPT including leak state in Noisy-OR model.................... 30 3.7 An example of Noisy-MAX distribution....................... 30 3.8 Transformation of Noisy-MAX distribution..................... 30 4.1 Details of Netherlands earthquake data in 1993................... 36 4.2 Details of flood data from KMNI, Katerveer, the Netherlands, 12/10/2000.... 37 4.3 Details of wind data in 0.1 m/s from USGS, New York, the USA, 08/01/2001 to 10/01/2001....................................... 37 4.4 Fitting Results of Four Distributions......................... 43 4.5 Comparison of K-S Statistic Values, Original vs Limited.............. 44 4.6 Parameter combinations of Pareto distribution with different K-S statistic values 45 4.7 Probability of Earthquake with Different Magnitude Intervals........... 46 4.8 95% Confidence intervals of α............................. 50 4.9 Lower bound and upper bound for the 1 in X years earthquake.......... 50 4.10 Learning Results of Different α............................ 53 5.1 Sampling Algorithm Efficiency............................ 56 5.2 Inference Results: Evidence on Risk......................... 57 5.3 Inference Results: Evidence on Business Process.................. 58 5.4 Sensitivity between Hack and Processes...................... 61 5.5 Sensitivity between Power outage and Processes................. 62 5.6 Sensitivity analysis, nodes of interest are in the set of observed nodes....... 64 6.1 VaR of Different Confidence Levels.......................... 68 6.2 Probability that Loss Exceeds Certain Threshold.................. 68 B.1 Pareto and Generalized Pareto K-S statistics in 10 water stations......... 76 viii
Chapter 1 Introduction In the first chapter, we will motivate and introduce the subject of this thesis. The introduction will start with some background information and strategies of Business Continuity Management (BCM) that are applied. Thereafter, relevant literatures and research methods will be explained briefly. Finally, the structure of the thesis will be presented. 1.1 Business Continuity Management BCM is a management process that identifies potential threats to an organisation and the impacts to business operations those threats - if manifest - might cause, and which provides a framework for building organisational resilience with the capability for an effective response that safeguards the interests of its stakeholders, reputation, brand and value creating activities [1]. The process includes moving operations, (recovering operations) to another location if a disaster occurs at a worksite or datacentre. The scope of planning we focus on should include recovering from different levels of disaster which can range from in a short time, localized disasters, to days long problems among several buildings, to a permanent loss of a building [2]. The risks that BCM concerns are part of operational risks (See risk taxonomy in Figure 1.1) The analysis phase consists of business impact analyses (BIA) and business continuity risk analyses (BCRA). The BIA is an essential component of a bank s business continuity management; it focuses on understanding of bank s criticality and vulnerability of processes their dependencies [3]. Some processes are more crucial than others and require a greater allocation of funds for measures which are taken to prevent a disaster from occurring and additional measures which are taken to mitigate the impacts. In this thesis, we will use a graph database to describe the properties and relationship upon internal business structure, which enables us to execute queries under certain conditions. In addition, social network indicators will also be applied in evaluating the criticality and vulnerability among different groupsit focuses on understanding of bank s criticality and vulnerability of processes their dependencies. Business continuity risk analysis (BCRA) investigates possible threats in details, the probability or likelihood of those threats, and the making of experiments on different scenarios [2]. Common threats include earthquake, flood, hurricane or other major storms, power outage, pandemic, fire, cyber-attack and random failure of systems. These threats are to be observed on buildings or applications. In order to look into links among risks and critical processes, building a Bayesian network with respect to business structure is necessary, and the dependencies of those relationships are displayed as conditional distributions within the networks. A typical 1
CHAPTER 1. INTRODUCTION Figure 1.1: Bank Risk Taxonomy method to assess threats is to fit risk data to certain probability distributions and set thresholds based on severity levels so that probabilities can be figured out on those critical points under cumulative distributions. Another approach is learning parameters in Bayesian network directly using available data that generate a posterior distribution if prior information is known on a particular confidence. Both the processes make estimations on occurrences of incidence with some mitigation methods being taken into consideration. To investigate the inaccuracy of assessment, sensitivity analysis is performed which shows that varying in parameters can have corresponding impacts on its output in some quantitative relationships. Moreover, Bayesian inference helps us to see outcomes of conditional probabilities of interest if there are threats being observed. An important application of quantitative analysis in BCM is Value at Risk. Value at Risk (VaR) in BCM is defined as the maximum loss amount that an organization would not exceed over a given time horizon on certain confidence level. It is a decent estimate that tells an organization how much capital should be reserved for potential risk every year. VaR in BCM usually considers the probability of an event happening in one year instead of its frequency distribution. A loss distribution is provided along with a risk probability distribution so that VaR can be calculated by aggregated loss distribution. Expected and unexpected losses can be identified which providing an estimation on provisions and capital requirements. 1.2 Literatures and Research Methods The ideas of the BIA are widely introduced in a range of literature, and many banks are also working on it [4]. In practice, BIA is an important process that probes into business processes to determine and list critical processes that are vital to keep the business going. It is 2
CHAPTER 1. INTRODUCTION Figure 1.2: Business Structure In a Bank necessary to understand business environments, gather data and information, identify critical processes needed to carry out vital business operations and finally prepare a BIA report enlisting your findings to be submitted to the top management. Potential impacts from an outage are identified while corresponding recovery time objects (RTO) are determined. Furthermore, financial impacts of outage of processes for specific functions are concerned. This gives us an overview of flow in BIA analysis. Figure 1.3: Phases of BIA BCRA includes risk assessment in that it lists the types of outages a bank is likely to suffer in a year and the vulnerability to certain outages such as power failures, building fire, and so forth. In the risk assessment, we identify the most probable threats as determined under Basel II [2] to the organization and analyze related vulnerabilities to these threats. H. Chen and A. Pollino [5] expressed the basic ideas about Bayesian analysis of BCRA. In this method, causes and effects in BCRA can be represented in a Bayesian network, where the business structure of an organization is reflected in interactions between variables by a graph with nodes and arcs. The strength of these relationships is defined in the conditional probability tables (CPTs) 3
CHAPTER 1. INTRODUCTION Figure 1.4: Phases of BCRA attached to each node. CPTs specify the degree of belief (probabilities) that the nodes will be in particular states given the states of parent nodes (the nodes that directly affect that node). The probabilities of risk nodes are obtained from risks data that are usually fitted to probability distributions. In addition, foundation of a Bayesian network and its probabilistic analysis is well defined by D. Koller and N. Friedman [6], while famous inference algorithms are also introduced by them. M.H. Coupe and L.C. van der Gaag gave an approach of sensitivity analysis within a Bayesian network [7]. Finally, loss distribution and VaR solutions of BCRA can be found in compared to the paper of E. Navarrete [8]. We have made some advance in the probabilistic analysis in this thesis of previous research. First of all, we introduce social network metrics such as centralities, which are decent indicators of criticality and vulnerability, and apply them to the business world. Furthermore, statistical techniques such as Monte Carlo simulation and the bootstrap method are frequently used in this analysis. For heavy tailed events, VaR is still a good measure of loss on a particular confidence level in BCM that gives us an extension of its application. 1.3 Thesis Structure The outline of the thesis is given as follows: In Chapter 2, we build a business graph database which contains the business components and their dependencies. It is possible for us to query in the database and find the influences of outage of business processes within the business structure. Subsequently, the network structure is trasferred to Gephi so that centralities can be calculated. A description of the analysis is supported by an example. This network is based on the existing network structure of ABN AMRO. In Chapter 3, we introduce graph theory and construct Bayesian networks based on the fundamental concepts. We give a detailed derivation of the properties in Bayesian networks in mathematical definitions and theorems. Furthermore, two types of conditional probability tables, common and Noisy-OR (Noisy-MAX), are introduced and compared. Building risks and application risks models are presented based on the bank business structure as an application. 4
CHAPTER 1. INTRODUCTION In Chapter 4, we fit the risk data to some probable distribution and estimate the event probability given thresholds. A technique of goodness of fit helps us to select an appropriate distribution. In particular, the Pareto distribution is suitable for events where the likelihood of values is monotone and non-increasing but converges slowly. Another method to estimate risk probability is Bayesian estimation. Using a Dirichlet distribution as prior the probability distribution is updated by incoming data. In Chapter 5, Bayesian inference algorithms are provided to calculate the conditional probability for every node given particular evidence. We discuss the differences among different algorithms in terms of calculation speed and accuracy. Thereafter, sensitivity analysis methods are defined and applied in Bayesian network. Hence, we are possible to know the inaccuracy of parameter input in nodes and find the interactions between the conditional probability of nodes of interest and probability under study. In Chapter 6, we apply VaR in BCM analysis, combining event distribution and loss distribution to obtain an aggregated distribution. VaR are obtained from percentile of confidence levels of interest. In particular, Monte Carlo simulations are again used to generate loss samples and validate the accuracy of mean and percentiles. In Chapter 7, we give a short conclusion and put forward some suggestions for future research that is not included in this thesis. 5
Chapter 2 Business Impact Analysis In this chapter, we focus on analyzing the interconnection of process elements in different groups and their criticality and vulnerability described using social network characteristics. Two tools are used to implement the models: Neo4j and Gephi. Later we will present the modeling procedures and some example results. 2.1 Business Graph Database Relational database systems are generally efficient unless the data contains many relationships requiring combinations of large relationship tables. In general, a graph database does better at structural type queries than a relational database. A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way. As in Figure 2.1, a graph starts with nodes (vertexes) with labels, which are used to group the nodes and restrict queries to subsets of the graph. Relationships (arcs) organize nodes into arbitrary structures, with properties that can identify different types of dependencies [9]. Figure 2.1: Graph Elements and Properties 6
CHAPTER 2. BUSINESS IMPACT ANALYSIS 2.1.1 Business Structure in a Bank In the existing business structure of a bank (Figure 2.2), nodes are realized as business processes, sub-processes, RTOs (recovery time objects), applications, business lines and buildings in different locations; labels on nodes are their corresponding names and unique IDs; relationships are defined as 1-to-n mapping relations such as located in, contains and has, or m-to-n mapping relations such as use. Figure 2.2: Overview Nodes and Relationships of Business Structure At the level of interest we have about 20 business processes, with approximately 200 sub processes, 70 locations, 150 applications, 6 RTOs and 20 business lines involved in the business structures. All of these elements can be displayed in a graph network if we import the necessary information. To realize this, we use the data, collected in Excel, to create Cypher query sentences and put them into Neo4j (See Figure 2.3). Neo4j is an open-source graph database implemented in Java [10]. It can also use the Cypher query language, which is a declarative graph query language that allows for expressive and efficient querying and updating of the graph store, to create or retrieve information based on our interests [11]. It is easy to obtain results intuitively and simply in a visualized interface. 2.1.2 An Application of Query in Neo4j Let us consider an example. If we want to know Which applications are used by business processes that have an RTO of 0-2 hrs and that are located in building of Building 1?, we can look at the structure in Figure 2.2 and see the interconnections of nodes that we are interested 7
CHAPTER 2. BUSINESS IMPACT ANALYSIS Figure 2.3: Business Processes Nodes with Names and IDs in. First, applications are used by processes, which are included in some business processes. Subsequently, those processes are used by certain business lines, some of which are located in the building of Building 1. Finally, the target processes or business processes who have a RTO 0-2 hrs would meet our requirement. As a consequence, we immediately have the following statements MAT CH (rto : RT O {name : 0 2hrs }) [: BUSINESSP ROCESS HAS RT O] (bp : BusinessP rocess), p1 = ShortestP ath(bp [..3] (bul : Building {name : Building1 })), p2 = ShortestP ath(bp [..2] (a : Application)) RET URNp1, p2, rto limit 10; In Neo4j we put the statements into the command window and execute them. After a span of time the nodes and the paths that meet our restrictions are displayed in the output area (Figure 2.4). For further investigation, we can right click the nodes to see their properties, and double click to see more nodes that are connected with them. In the output network, we limit to 10 applications (in yellow) that fit our requirement. It is clear to see how applications and the target building (in cyan) are interconnected. An observation of damage on Building 1 can have impacts on some applications through relative business lines (in pink) and business processes (in red), which have RTO 0-2 hrs (in white). The results can also be exported by table in Excel, which enables us to perform further analysis. If there are some changes in the business structure, those changes can also be updated in the network by some standard operations. We can add or delete nodes and their corresponding labels, and modify relationships among them if necessary. 8
CHAPTER 2. BUSINESS IMPACT ANALYSIS Figure 2.4: Overview of Output Result in Neo4j under the Restriction of 10 Applications 2.2 Business Network Analysis 2.2.1 Basic Network Concept (a) (b) Figure 2.5: Undirected and Directed Graph Now we will have a look at more characteristics in network analysis. First we need some basic mathematical concept for the network. The definition of network metrics is provided by I. Robinson, J. Webber and E. Eifrem [9]. In mathematical literature, we refer to the network as a graph. A graph is an ordered pair G = (V,E) comprising a set V of nodes together with a set E of edges or lines, which are a 2-element subsets of V (i.e., an edge is related with two nodes, and the relation is represented as a pair of the nodes with respect to the particular edge). The most common graphs are undirected graphs and directed graphs [12][13]. An undirected graph is a graph, i.e., a set nodes that are connected together, where all the edges are bidirectional. An undirected graph is sometimes called an undirected network. In contrast, a graph where 9
CHAPTER 2. BUSINESS IMPACT ANALYSIS the edges point in a direction is called a directed graph. A mathematical explanation of directed graph is given in Section 3.1. Consider an undirected network with n nodes. Let us label the nodes with integer labels 1,..., n and denote an edge between nodes i and j by (i, j), then the complete network can be specified by giving the value of n nodes and a list of all the edges. A good representation of a network is the adjacency matrix. The adjacency matrix A of a simple graph is the matrix with elements A ij such that { 1, if there is an edge between node i and j A ij = (2.1) 0, otherwise For example, the undirected network in Figure 2.5(a) can be represented as 0 1 1 A = 1 0 1 (2.2) 1 1 0 The matrix is symmetric, since interconnection between nodes is reciprocally. Similarly, the adjacency matrix of a directed network has matrix elements { 1, if there is an edge from node i to j A ij = 0, otherwise (2.3) As an example, the adjacency matrix of the directed network in Figure 2.5(b) is 0 1 1 A = 0 0 1 (2.4) 0 0 0 Note that this matrix is not symmetric. In general the adjacency matrix of a directed network is asymmetric. As Figure 2.5 shows, each of the nodes can reach another through a path like a circle. A cycle in a directed network is a closed loop of edges with the arrows on each of the edges pointing the same way around the loop. Some directed networks however have no cycles and these are called acyclic networks. Graphs with cycles are called cyclic. In Figure 2.5, if any of the edges is eliminated, then the cyclic directed network becomes acyclic. A typical directed acyclic network is a Bayesian network, which can represent probabilistic relationship between random variables. We will see more formal explanations and applications of Bayesian networks in Chapter 3. To measure the connection properties within a graph, we use the degree of a node, which is the number of edges connected to it. We will denote the degree of node i by k i. For an undirected graph of n nodes the degree can be written in terms of the adjacency matrix as k i = n A ij (2.5) j=1 Every edge in an undirected graph has two ends and if there are m edges in total then there are 2m ends of edges. But the number of ends of edges is also equal to the sum of the degrees 10
CHAPTER 2. BUSINESS IMPACT ANALYSIS of all the nodes, so m = 1 2 n k i (2.6) i=1 The mean degree c of a node in an undirected graph is c = 2m n (2.7) Node degrees are more complicated in directed networks. In a directed network each node has two degrees. The in-degree is the number of ingoing edges connected to a node and the outdegree is the number of outgoing edges. Bearing in mind that the adjacency matrix of a directed network has element A ij = 1 if there is an edge from i to j, in- and out-degrees can be written as k in i = n j=1 A ji k out j = n A ji (2.8) i=1 The number of edges m in a directed network is equal to the total number of incoming ends of edges at all nodes, or equivalently to the total number of outgoing ends of edges, so m = n i=1 k in i = n j=1 k out j (2.9) Thus the mean in-degree c in and the mean out-degree c out are equal in every directed network c in = m n = cout (2.10) Another important property that we introduce is path. A path in a network is any sequence of nodes such that every consecutive pair of nodes in the sequence is connected by an edge in the network. In a directed network, each edge traversed by a path must be traversed in the correct direction for that edge. In an undirected network edges can be traversed in either direction. More explanation about path in mathematics can be found in Chapter 3. The length of a path in a network is the number of edges traversed along the path (not the number of nodes). It is straightforward to calculate the number of paths of a given length r on a network. For either a directed or an undirected simple graph the element A ij is 1 if there is an edge from node i to node j, and 0 otherwise (undirected A ij =A ji ). Then the product A ik A kj is 1 if there is a path of length 2 from i to j via k, and 0 otherwise. And the total number of paths of length two from i to j, via any other node, is N (2) ij = n A ik A kj = [A (2) ] ij (2.11) where [...] ij denotes the ij-th element of a matrix. Similarly in r = 2, we generalize to paths of arbitrary length r, we see that k=1 where [A (r) ] = A ik1 A k1 k 2... A kr 1 j. N (r) ij = [A (r) ] ij (2.12) 11
CHAPTER 2. BUSINESS IMPACT ANALYSIS As a consequence, we obtain the solution of how to calculate the number of paths between two nodes in a network. 2.2.2 Centrality One of the most important metrics that is involved in network analysis is centrality [12]. Centrality of a node measures its relative importance within a graph. Its concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin. Nonetheless, the methods described are now widely used in areas outside the social sciences [12]. Applications of centrality include how influential a person is within a social network, how important a room is within a building, how critical a hub in the Internet, and how well-used a road is within an urban network. Here we apply it to quantify our business structure. In this section three measures of centrality are introduced: degree, betweenness and eigenvector [14]. These centralities are good measures of node importance; we will review their definitions and their application in business network analysis. The simplest centrality measure in a network is just the degree of a node, the number of edges connected to it (see Section 2.2.3). Degree is sometimes called degree centrality in the social networks literature, to emphasize its use as a centrality measure. In directed networks, nodes have both an in-degree and an out-degree, and both may be useful as measures of centrality in the appropriate circumstances. Although degree centrality is a simple centrality measure, it can be very illuminating. In the business structure of a bank, for instance, Process contributes in-degree to Business Line, while edges from Business Line to Building are counted as out-degree. The Business Line that have a higher in-degree (critical index) are more critical to their other nodes because incidents on them may have impact on more Process. On the other hand, those with lower out-degree (vulnerability index) are more vulnerable since only one or a few buildings failure may cause outage on them. A different concept of centrality is betweenness centrality, which measures the extent to which a node lies on paths between other nodes [15]. Nodes with high betweenness centrality may have considerable influence within a network by virtue of their control over information passing between others. Specifically, the betweenness centrality of a node i is defined to be the number of those paths that pass through i. Mathematically, let n i st be 1 if node i lies on a path from s to t and 0 if it does not i.e. there is no such path. Then the betweenness centrality x i is given by x i = st n i st (2.13) In a generalized perspective, we can express the betweenness for a general network by the number of paths from s to t that pass through i. And we define g st to be the total number of paths from s to t. Then the betweenness centrality of node i is x i = st n i st g st (2.14) A node can have quite a low degree, be connected to other nodes that have a low degree, even 12
CHAPTER 2. BUSINESS IMPACT ANALYSIS be a long way from others on average, and still have high betweenness. Figure 2.6 is an example. We can see that node A lies on a bridge between two groups within a network. Since any path between a node in one group and a node in the other must pass along this bridge, A acquires very high betweenness, even though its degree centrality is only 2, but nonetheless it might have a lot of influence in the network as a result of its control over the flow of information between others. This extreme example, A, is called single-failure point. The failure of A can yield an outage of its corresponding processes. Consider a normalized betweenness centrality, which is defined by the ordinary betweenness centrality (2.14) divided by (n 1)(n 2), where n is the total number of nodes in the network. Thus the normalized betweenness centrality is bounded by the interval [0, 1], and the single-failure point always has normalized betweenness 1. Figure 2.6: Node with High Betweenness Centrality and Low Degree Centrality A natural extension of the simple degree centrality is eigenvector centrality [16]. In many circumstances a node s criticality in a network is increased by having connections to other nodes that are themselves important. Instead of awarding nodes just one point for each neighbor, eigenvector centrality gives each node a score proportional to the sum of the scores of its neighbors. Let us make some initial guess about the centrality x i of each node i. For instance, we could start by setting x i = 1 for all i. We can use this setting to calculate a better one, which we define to be the sum of the centralities of i s neighbors thus: x i = j i A ji x j (2.15) where A ji is an element of the adjacency matrix (undirected matrix, both ingoing and outgoing neighbours are considered). We can write this expression in matrix notation as x = Ax, where x is the vector with elements of x j. Repeating this process to make better estimates, we have a vector of centralities x(t) after t steps given by x (t) = A t x(0) (2.16) Now let us write x(0) as a linear combination of the eigenvectors v i of the adjacency matrix thus x(0) = i c i v i (2.17) for some appropriate choice of constants c i. Then x(t) = A t i c i v i = i c i λ t iv i = λ t 1 i c i ( λ i λ 1 ) t v i (2.18) where the λ i are the eigenvalues of A, and λ 1 is the largest of them. Note that undirected matrix 13
CHAPTER 2. BUSINESS IMPACT ANALYSIS A is real symmetric, its every eigenvalue is real [17]. Since λ i λ 1 < 1 for all i 1, all terms in the sum other than the first decay exponentially as t becomes large, and hence in the limit t we get x(t) c 1 λ t 1 v 1, i.e., a constant. In other words, the limiting vector of centralities is simply proportional to the leading eigenvector of the adjacency matrix. Equivalently we could say that the centrality x satisfies Ax = λ 1 x (2.19) This then is the eigenvector centrality. As promised the centrality x i of node i is proportional to the sum of the centralities of i s neighbors (Perron-Frobenius theorem [12]) x i = λ 1 1 A ij x j (2.20) which gives the eigenvector centrality the nice property that it can be large either because a node has many neighbors or because it has important neighbors (or both). A Business Line in a business network, for instance, can be important, by this measure, because it is linked to many processes (even though those processes may are less important) or connects to processes that are critical. On the other hand, a RTO that is in relation to more critical processes may have higher eigenvector centralities, showing that the RTO itself is also critical. j 2.2.3 An Application of Centrality in Gephi We will describe the application of centrality in a business data model shown as Figure 2.2 in Gephi. Gephi is an open-source network analysis and visualization software package written in Java on the NetBeans platform [18]. It is convenient to build a visualized network with an elegant layout and data structures from Neo4j can be imported into Gephi easily. After doing this, metrics of the network, nodes and edges can be computed based on the graph. We will see the procedure and results next. Figure 2.7: Data Importing from Neo4j 14
CHAPTER 2. BUSINESS IMPACT ANALYSIS Figure 2.8: Grouped, Colored Network and Layout in Good Visualization Figure 2.9: Nodes Properties and Metrics in Data Laboratory 15
CHAPTER 2. BUSINESS IMPACT ANALYSIS In Figure 2.7 and Figure 2.8, we show the differences of our business data model before and after using the Force Atlas 2 layout algorithm [19]. Different colors represent different groups of nodes in the network. Node sizes depend on their degree centrality, ranging from size 2 to 10. In the data laboratory, more specific details are displayed (Figure 2.9). We can view, add, delete and modify all properties of nodes and edges in this window and network metrics can also be shown there after they have been computed. Note that in the calculation of eigenvector centrality, we set the number of iterations to 100.Larger iterations indicate that the value we obtain more likely converges to real value, but it takes longer time for calculations. We then export the data into Excel, and separate them into different groups to conduct analysis. For Business Process, they only have out-degree but not in-degree, and their betweenness and eigenvector centrality are both zero, indicating that these metrics are meaningless. Therefore we can estimate the vulnerability of Business Process purely on their out-degree centrality. The opposite holds for Application nodes, which only have in-degree but not out-degree. The number of Process nodes is large, and they have similar values of in- and out- degree. However, Process is linked to Business Process, RTO, Application and Business Line ; therefore they are crucial bridges between other groups of nodes. To evaluate their criticality, the best metric we can choose is betweenness centrality, the larger of which indicates a more vital role of Process within the network. Next we turn to the Building, which has the same out-degree (with n-to-1 mapping relations to city) and discrepant in-degree (contain different numbers of business lines). Furthermore, their eigenvector centrality does highly depend on the importance of Business Line so that those buildings where critical business lines are located in might have larger value of eigenvector centrality. Business Line nodes have the most complicated situations. They have different numbers of edges come from Process and may be located in various buildings. As we see at the beginning of Section 2.2.2, high in-degree implies more influence on Process while low outdegree indicates more significant vulnerability due to less buildings linking to them. Moreover, Betweenness centrality also explains how important a business line acts as a bridge in the network. In addition, eigenvector centrality is a good improvement of degree centrality since nodes criticality can be measured more accurately. We should notice that this analysis is only an estimated exploration of the internal dependencies and influences within the business structure. The data we use may not perfectly reflect the real structure and sometimes it changes over time. Therefore we should keep track on those changes and update the information in the model accordingly. 16
CHAPTER 2. BUSINESS IMPACT ANALYSIS Figure 2.10: Out-degree of Business Process and in-degree of Application. The smallest value of out-degree in Business Process has BP24, which has one process and one RTO. The largest value of in-degree of Application have APP1045 and APP1033, they are both used by 21 processes Figure 2.11: Betweenness centrality of Process. The most influential one is P352 under this measure 17
CHAPTER 2. BUSINESS IMPACT ANALYSIS Figure 2.12: In-degree and eigenvector centrality of Building. Both the metrics indicates that BU10016 has more impacts on business lines than other buildings have Figure 2.13: Out-degree, betweenness and eigenvector centrality of Business Line. BL5011 has the lowest out-degree (only located in one building), and the largest values of betweenness and eigenvector centrality have BL5017 and BL5016 18
Chapter 3 BCRA: Model Description We want to analyze BCRA based on a probabilistic graphical model, especially a Bayesian network. Therefore in this section, we will describe the mathematical background of Bayesian networks and conditional probability distributions. Moreover, an application to business structure in a bank is provided. This example will also be used in the following chapters. Most of the theories belows are from [6]. 3.1 Graphs In Section 2.1.1, we have already defined some basic concepts of graphs. In order to lay the foundations of a Bayesian network, we will see some representations of a probability distribution using a graph as a data structure. In this section we will survey more concepts in graph theory in a mathematical way. A graph is a data structure K consisting of a set of nodes and a set of edges. Throughout most of the later contents, we use a set of discrete random variables X = {X 1,..., X n } to represent a set of nodes, in which a pair of nodes X i, X j can be connected by a directed edge X i X j. We are only interested in directed graphs here since Bayesian networks are directed. Thus the set of edges E is a set of pairs, where each pair is one of X i X j or X j X i, for X i,x j X, i < j. A graph is called directed if all edges are X i X j or X j X i. The graph is is denoted as G (See also Section 2.2.3). Definition 3.1.1. (Directed Graph) Given a graph K = (X, E), its directed version is a graph G = (X, E ), where E = {X i X j : X i, X j X }. Whenever we have that (X i X j ) E, we say that X j is the child of X i in K, and that X i is the parent of X j in K. We use P a X to denote the parents of X, Ch X to denote its children. Figure 3.1 shows an example of directed graph G. There we have A is the parent of B and C, and E is their shared child. In many cases, we want to consider only the part of the graph that is associated with a particular subset of the nodes. Hence we can focus on a sub-graph of a particular graph. Definition 3.1.2. (Sub-graph) Let K = (X, E), and let X X. We define the sub-graph K[X] to be the graph (X, E ), where E are all the edges X i X j, X i, X j X. The sub-graph X is complete if every two nodes in X are connected by some edges. 19
CHAPTER 3. BCRA: MODEL DESCRIPTION Figure 3.1: An Example of Directed Graph G Although the subset of nodes X can be arbitrary, we are often interested in sets of nodes that preserve certain aspects of the graph structure. Using the basic notion of edges, we can define different types of longer-range connections in the graph. First we introduce a mathematical explanation of path and trail. Then the concepts of ancestor and descendant, cycle and loop are defined. Definition 3.1.3. (Path and Trail) We say that X 1,..., X k form a directed path in the graph K = (X, E) if, for every i = 1,..., k 1, we have X i X i+1. Furthermore, X 1,..., X k form a trail in the graph K = (X, E) if, for every i = 1,..., k 1, we have X i X i+1 (X i X i+1 or X i+1 X i ). A graph is connected if for every X i, X j there is a trail between X i and X j. In the graph of Figure 3.1, A, B, D form a trail and it is also a path. On the other hand, B, E, C is a trail but not a path. We can now define the long-range relationship in the graph. Definition 3.1.4. (Ancestor and Descendant) We say that X is an ancestor of Y in K = (X, E), and that Y is an descendant of X, if there exists a directed path X 1,..., X k with X 1 = X and X k = Y. We use Descendants X to denote X s descendants, Ancestors X to denote X s ancestors. In Figure 3.1, we have A is the ancestor of D, E and F, and D, E and F are descendants of A. Definition 3.1.5. (Cycle and loop) A cycle in K is a directed path X 1,..., X k where X 1 = X k. A graph is acyclic if it contains no cycles. A loop in K is a trail X 1,..., X k where X 1 = X k. One of the most important concepts in this thesis is a directed acyclic graph (DAG), as DAGs are the basic graphical representation that underlay Bayesian networks. In Figure 3.1, we see it is a DAG, and adding an edge form D to E would lead to a loop in B, D and E. Moreover, adding an edge from E to A would lead to a circle in A, B and E. A final useful notion is that of an ordering of the nodes in a directed graph that is consistent with the directionality of its edges. Definition 3.1.6. (Topological Ordering) Let K = (X, E) be a graph. An ordering of the nodes X 1,..., X n is a topological ordering relative to K if, whenever we have X j X i, for X i, X j X, then i < j. 20
CHAPTER 3. BCRA: MODEL DESCRIPTION Topological ordering to implement Bayesian inferences has been applied by many inference algorithms such as Variable Elimination (VE). We will investigate Bayesian inference in details in Chapter 5. 3.2 Bayesian Network As we have pointed out in Chapter 2, one of the most typical applications of probabilistic graphical models is a Bayesian network (BN), which represents a set of random variables and their conditional dependencies via a DAG. The core properties of Bayesian networks are its conditional parameterization and independencies. We will explore the representation of a Bayesian network and its structure properties in this section. 3.2.1 Independencies Before introducing the representation of a Bayesian network, we first recall Bayes s theorem [20], which is of importance in the mathematical manipulation of conditional probabilities. Definition 3.2.1. Let H and E be events with P (E) 0, then the conditional probability of H given E, P (H E), is define as P (H E) = P (E, H) P (E) where P (E, H) is the joint probability of H and E. Theorem 3.2.2. (Bayes Theorem) (i) (Simple form) Assuming that H and E are events, and P (E) 0. Then the probability of H given E is given by P (H E) = P (H)P (E H) P (E) (ii) (Extended form) Assuming that H 1,..., H n is a partition of events, which are the whole sample space, so they are mutually exclusive as toghther, and E is an event, P (E) 0, then the probability of H i given E, i = 1,..., n, is given by P (H i E) = P (H i)p (E H i ) j P (H j)p (E H j ) In a Bayesian network we can have a joint probability distribution that represents the whole network, which can be written as P (X 1,..., X n ), where X 1,..., X n are nodes in the BN under topological ordering. By Bayes Theorem, it is easy to factorize the joint probability into a product of n conditional probabilities as P (X 1,..., X n ) = P (X 1 )P (X 2 X 1 )P (X 3 X 1, X 2 )... P (X n X 1,..., X n 1 ) n = P (X 1 ) P (X i X 1,..., X i 1 ) (3.1) i=2 21
CHAPTER 3. BCRA: MODEL DESCRIPTION If X 1,..., X n are independent random variables, then (3.2) can be written as P (X 1,..., X n ) = P (X 1 )P (X 2 )... P (X n ) = n P (X i ) (3.2) Factorization of a distribution P is a property indicating that independencies exist in P, and vice versa. In Bayesian networks, we often consider the influence of conditional independencies between the nodes that simplifies the representation of the factorized joint probability distribution in (3.2). Before we look at the independencies in Bayesian networks, we first consider the flow of probabilistic influence within the network, with which we can find the independencies between ancestor and descendant nodes. Let us consider an example of a simple Bayesian network. Suppose two buildings have equal possibility of fire occurrence (Figure 3.2). The network in Figure 3.2 has a contrary ordering with that in Figure 2.2. The directions of relationships in Figure 2.2 illustrate the dependencies between business components, and that in Figure 3.2 represent the flow of influence caused by threats. Business Process 1 is related to Building 1, and Business 2 is related to Building 1 and Building 2. Therefore the influence of Building 1 due to fire may be passed to Business Process 1. On the other hand, Business Process 2 can give us the information of whether it fails, which can be passed from Building 1 to Fire. Moreover, outage in Business Process 2 indicates that incident occurs in Building 1 or Building 2. i=1 Figure 3.2: Flow of Probabilistic Influence in Bayesian Network Another case is whether influence of Building 1 can be passed from Business Process 2 to Building 2. An observation of fire in Building 1 can give us the information that it may have impact on Business Process 2, but this does not tell us if there is a fire in Building 2 or not. Thus information cannot flow from Building 1 to Building 2 through Business Process 2. In contrast, if we have an observation on the intermediate node, we will get an inverse conclusion. In Figure 3.2, given evidence on Building 1 information cannot flow from Fire to Business Process 1 or from Business Process 1 to Fire. Similarly, Business Process 1 cannot influence Business Process 2 via Building 1 that if information on Builiding 1 is given. But the observation of Business Process 2 will give us the information that, if fire happens in Building 1, it less likely happens in Building 2. In this case, the trail is active if evidence has been set. 22
CHAPTER 3. BCRA: MODEL DESCRIPTION When influence can flow from X to Y via W, we say that trail X W Y is active. The results of our analysis for active two-edge trails are summarized in Table 3.1: Trail W / Z W Z X W Y Active Inactive Y W X Active Inactive X W Y Active Inactive X W Y Inactive Active Table 3.1: A summary if a trail is active or not depends on whether W belongs to evidence set Z The structure X W Y is also called a v-structure. In a general case of a longer trail X 1... X n, it is easy to see that impacts can flow from X 1 to X n if every two edge trail X i 1 X i X i + 1 along the trail is active. Thus we can summarize this intuition in the following definition Definition 3.2.3. (Active Trail) Let G be a BN structure, and X 1... X n a trail in G. Let Z be a subset of observed variables. The trail X 1... X n is active given Z if (i) Whenever we have a v-structure X i 1 X i X i+1, then X i or one of its descendants are in Z; (ii) No other node along the trail is in Z. Note that if X 1 or X n are in Z the trail is not active. Sometimes we may have more than one trail between two nodes. Thus one can influence another if there is any active trail along which influence can flow. Combining this with the former definition, we obtain the notion of d-separation, which provides us with a notion of separation between nodes in a directed graph. Definition 3.2.4. (D-separation) Let X, Y, Z be three sets of nodes in G. We say that X and Y are d-separation given Z, denoted d sep G (X; Y Z), if there is no active trail between any nodes X X and Y Y given Z. We use I(G) to denote the set of independencies that correspond to d-separation:. I(G) = {(X Y Z) : d sep G (X; Y Z)} Let us incorporate this result to the independencies of probability distributions in a Bayesian network. Let P be a distribution over X, then we define I(P) to be the set of independence assertions of the form (X Y Z) that hold under P. D-separation in G implies that probability distribution of P satisfies conditional independency P (X Y Z), if Z consists of parents of X, and Y are non-descendants of X. We rewrite this in another definition to reveal the relationship of G and its probability distribution P. Definition 3.2.5. (I-map) If P satisfies I(G), then we say that G is an I-map (independency map) of P. 23
CHAPTER 3. BCRA: MODEL DESCRIPTION 3.2.2 Factorization Now we turn to the example of Figure 3.2 again. We denote F as Fire node, B 1 and B 2 as Building 1 and Building 2, BP 1 and BP 2 as Business Process 1 and Business Process 2. Consider the joint probability distribution of this BN P (F, B 1, B 2, BP 1, BP 2 ) = P (F )P (B 1 F )P (B 2 F, B 1 )P (BP 1 F, B 1, B 2 )P (BP 2 F, B 1, B 2, BP 1 ) (3.3) By Definition 3.2.3, (B 1 B 2 F ), (BP 1 F B 1 ), (BP 1 F B 2 ), (BP 2 F B 1, B 2 ) and (BP 2 BP 1 B 1, B 2 ) hold, thus we can simplify (3.4) to obtain another representation P (F, B 1, B 2, BP 1, BP 2 ) = P (F )P (B 1 F )P (B 2 F )P (BP 1 B 1 )P (BP 2 B 1, B 2 ) (3.4) Notice that the local conditional probability distribution (CPD) of each node only depends on its parent nodes. This result tells us that any entry in the joint probability distribution can be computed as a product of factors, one for each variable. Each factor represents a conditional probability of the variable given its parents in the network. This factorization applies to any distribution P for which G is an I-map. We now state and prove this fundamental result more formally. Definition 3.2.6. (Chain Rule) Let G = (X, E) is a BN graph, then it can represent a joint probability P via the chain rule P (X 1,..., X n ) = n P (X i P a Xi ) where X 1,..., X n are nodes in BN, and P a Xi are the parent nodes of X i. The individual factors P (X i P a Xi ) are the CPDs. Under the Definition 3.2.3, we can define factorization of a Bayesian network. Definition 3.2.7. (Factorization) Let G = (X, E) is a BN graph over X 1,..., X n, then a probability distribution P factorizes over G if P (X 1,..., X n ) = i=1 n P (X i P a Xi ) i=1 Using factorization and the previous definition, we can formally define a Bayesian network in the following way Definition 3.2.8. (Bayesian Network) A Bayesian network is a pair B = (G, P ) where P is specified as a set of CPDs associated with G s nodes. The distribution P is often annotated P B. We can now prove that the phenomenon we observed in Figure 3.2 holds more generally. Theorem 3.2.9. Let G be a BN structure over a set of random variables X, and let P be a joint probability distribution over the same space. If G is an I-map for P, then P factorizes according to G. 24
CHAPTER 3. BCRA: MODEL DESCRIPTION Proof. Assume, without a loss of generality, that X 1,..., X n is a topological ordering of the variables in X relative to G. As in our example, we first use the chain rule for probabilities: P (X 1,..., X n ) = P (X 1 ) n P (X i X 1,..., X i 1 ) Now consider one of the factors P (X i X 1,..., X i 1 ). As G is an I-map for P, we have that P (X i NonDescendants Xi P a Xi ) I(P), if P a Xi are parents of X i, and Z are non-descendants of X i. By assumption, all of X i s parents are in the set {X 1,..., X i 1 }. Furthermore, none of X i s descendants can possibly be in the set. Hence, i=2 {X 1,..., X i 1 } = P a Xi Z where Z NonDescendants Xi. From the local independencies for X i and Definition 3.2.3, it follows that P (X i Z P a Xi ). Hence we have that P (X i X 1,..., X i 1 ) = P (X i P a Xi ) Applying this transformation to all of the factors in the chain rule decomposition, the result follows. Thus, the conditional independence assumption implied by a BN structure G allows us to factorize a distribution P for which G is an I-map into small CPDs. Note that the proof is constructive, providing a precise algorithm for constructing the factorization given the distribution P and the graph G. Theorem 3.2.9 shows one direction of the fundamental connection between the conditional independencies encoded by the BN structure and the factorization of the distribution into local probability models: that the conditional independencies imply factorization. The converse also holds: factorization according to G implies the associated conditional independencies. Theorem 3.2.10. Let G be a BN structure over a set of random variables X, and let P be a joint probability distribution over the same space. If P factorizes according to G, then G is an I-map for P. Proof. Assume again that X 1,..., X n is a topological ordering of the variables in X relative to G. Since P factorizes according to G, by Definition 3.2.7, we have P (X i X 1,..., X i 1 ) = P (X i P a Xi ) where P a Xi are the parent nodes of X i. Consider an individual factor P (X i P a Xi ). We want to prove that P (X i P a Xi ) = P (X i P a Xi Z), where Z NonDescendants Xi. This implies that P (X i NonDescendants Xi P a Xi ), which we use to apply Definition 3.2.4. As in Theorem 3.2.9, {X 1,..., X i 1 } = P a Xi Z, by Bayes Theorem, we immediately obtain P (X i P a Xi Z) = P (X i, P a Xi Z) P (P a Xi Z) = = P (X 1,..., X i ) P (X 1,..., X i 1 ) i P (X i P a Xi ) i 1 P (X i 1 P a Xi 1 ) = P (X i P a Xi ) Applying this result to all of the factors in the chain rule decomposition, by Definition 3.2.5, it follows that P satisfies I(G). 25
CHAPTER 3. BCRA: MODEL DESCRIPTION In summary, we have two equivalent views of BN graph structures: I-map to factorization: A directed graph G, annotated with a set of conditional probability distributions P (X i P a Xi ), together define a distribution via the chain rule for Bayesian networks. Factorization to I-map: Directed graph G, associated with independence assumptions, allows P to be represented The two complementary definitions are foundations of a Bayesian network, which simplify the connections among nodes in BN structures. The representations of a BN, local or global, give us an appropriate solution to conduct reasoning in various aspects. 3.3 Local Probabilistic Models In this section, we will examine CPDs in more detail and describe a range of representations and consider their implications in terms of additional regularities we can exploit. However, in order to choose the suitable models, we have to compare them with our BN structures. 3.3.1 Tabular CPDs The most common CPD is the tabular representation of CPDs, where we encode P (X P a X ) as a table that contains an entry for each joint assignment to X and P a X. For this table to be a proper CPD, we require that all the values are nonnegative, and that, for each value pa X, we have where V al(x) is the states set of X. x V al(x) P (x pa X ) = 1 (3.5) It is clear that this representation is as general as possible. We can represent every possible discrete CPD using such a table. As we will also see, table-cpds can be used in a natural way in inference algorithms. These advantages often lead to the perception that table-cpds, also known as conditional probability tables (CPTs), are an inherent part of the Bayesian network representation. Building 1 Fail Normal Building 2 Fail Normal Fail Normal Fail 0.36 0.2 0.2 0 Normal 0.64 0.8 0.8 1 Table 3.2: An example of CPT of Business Process 2 in Figure 3.2 We take Figure 3.2 again as an example. Let us consider the CPT of the node Business Process 2, which has two parents: Building 1 and Building 2, each of which has two states respectively. If we also set two states in Business Process 2, we will obtain 2 2 = 4 values and 2 2 2 = 8 numbers filled in CPT. Denote BP 2 as the normal state of Business Process 2, 26
CHAPTER 3. BCRA: MODEL DESCRIPTION and BP 2 as the fail state. The same settings are applied in B1 and B2. Then we can read from the table that P (BP 2 B1, B2) = 1, P (BP 2 B1, B2) = 0, P (BP 2 BP 1, B2) = 0.8, P (BP 2 B1, B2) = 0.36, etc. Suppose P (B1, B2) = 0.8, P (B1, B2) = 0.08, P (B1, B2) = 0.08, P (B1, B2) = 0.04. Note that marginal probability of each state in Business Process 2 can also be calculated from the table P (BP 2) = P (BP 2 B1, B2)P (B1, B2) + P (BP 2 B1, B2)P (B1, B2) + P (BP 2 B1, B2)P (B1, B2) + P (BP 2 B1, B2)P (B1, B2) = 0.9536 (3.6) Similarly, if fail state in Building 1 has been set evidence, then the updated marginal probability of normal in Business Process 2 will become P (BP 2) = P (BP 2 B2)P (B2) + P (BP 2 B2)P (B2) = 0.7808 (3.7) which is lower than that without evidence. However, when the CPT becomes very large and the Bayesian network is complicated, inference algorithms are better solutions to this question. We will discuss this in Chapter 5. The tabular representation also has several significant disadvantages. First, it is clear that we cannot store continuous random variable values in a CPT, which are often discretized into several intervals. Even in the discrete setting, we encounter difficulties. The number of parameters needed to describe a table-cpd is the number of joint assignments to X and P a X, that is, V al(x) V al(p a X ). This number grows exponentially with the number of parents. Thus, for example, if we have 5 binary parents of a binary variable X, we need specify 2 5 = 32 parameters; if we have 10 parents, we need to specify 2 10 = 1024 parameters. Clearly, the tabular representation rapidly becomes large and unwieldy as the number of parents grows. Huge tables not only require a lot of patience to fill in all conditional probabilities (perhaps thousands values), but also brings a challenge to the capacity of our computer memory. When learning parameters from data and implementing inference algorithms, huge CPT may give rise to too much time on running, even result in running out of memory. Furthermore, there might be some regularity in the parameters that describe similar possibilities of combination. For instance, in Table 3.1, the probabilities of fail or normal in Business Process 2 are the same under the condition combinations (B1, B2) and (B2, B1). Thus we may want to look for a better representation in order to reduce the number of parameters needed to specify a CPD. 3.3.2 Noisy-OR CPDs A practical solution to the problems in Section 3.3.1 is presented by taking advantage of independence of causal interaction (ICI), which provides gates that offer a reduction of the number of parameters required to specify a conditional probability distribution from exponential to linear in the number of parents. The two most widely applied ICI distributions are the binary Noisy-OR model and its generalization, the Noisy-MAX model [21]. They have the advantage that they are using a small numbers of parameters to represent the entire CPT. This superiority leads to a significant reduction of effort in filling in probability values, improves the quality of distributions learned from data, and reduces the running time and complexity of algorithms for Bayesian networks. In this section, we will introduce the Noisy-OR model (Noisy-MAX 27
CHAPTER 3. BCRA: MODEL DESCRIPTION as an extension to multiple values of variables), to see how it can simplify the CPDs and its relationship with CPT. Moreover, as Bayesian network may not be a perfect model that matches our real business structure, a leak state can be imported which consider some possibilities of causes that may not be included in our model [22]. The Noisy-OR model with leak state is first introduced by M.Henrion [22], and J.Diez [23] gives another representation of it. In the Noisy-OR model, multiple causes influence an effect independently, and their combination is specified as a or gate. Figure 3.3 shows the model graphically. Figure 3.3: Noisy-OR Model As in Figure 3.3, we define Z as an effect variable, the X i as cause variables, and the Y i as hidden variables. Suppose X i has two states 1 (target state) and 0 (subordinated target), which is the same for Y i and Z. We then define the probability of X i, with cause Y i individually as a noise parameter λ i, for i = 1,..., n, where { 0 X i = 0 P (Y i = 1 X i ) = (3.8) λ i X i = 1 The leak parameter in Y 0 is λ 0, which is often set extremely small if we believe that our model is approximately complete. Note that Y 1,..., Y n are mutually independent given X 1,..., X n. We immediately obtain a conditional probability distribution of Z, where and P (Z = 0 X 1,..., X n ) = (1 P (Y 0 = 1)) = (1 λ 0 ) i:x i =1 i:x i =1 (1 P (Y i = 1 X i )) (1 λ i ) (3.9) P (Z = 1 X 1,..., X n ) = 1 P (Z = 0 X 1,..., X n ) (3.10) Formula (3.9) indicates that the conditional probability of Z that falls in target state given X 1,..., X n can be written as the product of each conditional of Y i that falls in the target state as a function of the λ i. It shows the decomposition of CPD and explains why the model is called Noisy-OR. The interaction between causes and effect is an OR relationship that each of the cause variables can influence the effect variable independently through hidden variables. 28
CHAPTER 3. BCRA: MODEL DESCRIPTION Nonetheless, there is some noise that somewhat affects the value of effect variable. We can summarize this model in a parameter table as below (n = 3): Building 1 X 1 X 2 X 3 State 1 0 1 0 1 0 Leak Z = 1 λ 1 0 λ 2 0 λ 3 0 λ 0 Z = 0 1 λ 1 1 1 λ 2 1 1 λ 3 1 1 λ 0 Table 3.3: Parameter table of Noisy-OR model when n=3 In practice, we often hide the state columns when X i = 0 they are identical. Here we have 3 (2 1)+1 = 4 parameters and 4 2 = 8 values in this table. To see how this model specifies CPD, we transform Table 3.3 into a CPT as below X 1 1 0 X 2 1 0 1 0 X 3 1 0 1 0 1 0 1 0 Z = 1 1 φ(1, 1, 1) 1 φ(1, 1, 0) 1 φ(1, 0, 1) 1 φ(1, 0, 0) 1 φ(0, 1, 1) 1 φ(0, 1, 0) 1 φ(0, 0, 1) 1 φ(0, 0, 0) Z = 0 φ(1, 1, 1) φ(1, 1, 0) φ(1, 0, 1) φ(1, 0, 0) φ(0, 1, 1) φ(0, 1, 0) φ(0, 0, 1) φ(0, 0, 0) Table 3.4: CPT of an effect variable with 3 parents that transformed by Noisy-OR model where φ(1, 1, 1) = (1 λ 0 )(1 λ 1 )(1 λ 2 )(1 λ 3 ), φ(1, 1, 0) = (1 λ 0 )(1 λ 1 )(1 λ 2 ), φ(1, 0, 1) = (1 λ 0 )(1 λ 1 )(1 λ 3 ), φ(0, 1, 1) = (1 λ 0 )(1 λ 2 )(1 λ 3 ), φ(1, 0, 0) = (1 λ 0 )(1 λ 1 ), φ(0, 1, 0) = (1 λ 0 )(1 λ 2 ), φ(0, 0, 1) = (1 λ 0 )(1 λ 3 ), φ(0, 0, 0) = (1 λ 0 ). Here we have 2 3 = 8 parameters and 8 2 = 16 values in this table. When n becomes large, say, 20, the number of parameters in an ordinary CPT will grow to 2 20 = 1048576. But in Noisy-OR, the number of parameters is only 20 (2 1) + 1 = 21, which has a huge discrepancy comparing with that of CPT. In the example of Figure 3.2, we can use a Noisy-OR model to represent the CPD in Table 3.5 Parent Building 1 Building 2 State Fail Normal Fail Normal Leak Fail 0.2 0 0.2 0 0 Normal 0.8 1 0.8 1 1 Table 3.5: Noisy-OR model of example in Figure 3.2 Here λ 0 = 0. It is easy to read from the table that P (BP 2 B1, B2) = (1 0)(1 0) = 1, P (BP 2 B1, B2) = 1 (1 0)(1 0) = 0, P (BP 2 B1, B2) = (1 0.2)(1 0) = 0.8, P (BP 2 B1, B2) = 1 (1 0.8)(1 0.8) = 0.36, that we obtain the same result as in Section 3.3.1. If we set λ 0 = 0.001 in Table 3.6, then the resulting conditional probability distribution is almost identical with Table 3.2 on most purposes. As an extension of Noisy-OR, Noisy-MAX model deals with an effect variable that has multiple states. The resulting value for the effect variable is the maximum of the states produced by each of its cause variables. The probability distribution of an effect variable Z given X 1,..., X n as its causes without leak state can be expressed as in Table 3.7. We obtain a CPT in Table 3.8 using Table 3.4 29
CHAPTER 3. BCRA: MODEL DESCRIPTION Building 1 Fail Normal Building 2 Fail Normal Fail Normal Fail 0.36064 0.2008 0.2008 0.001 Normal 0.63936 0.7992 0.7992 0.999 Table 3.6: CPT including leak state in Noisy-OR model Cause X 1 X 2 State 1 0 1 0 Leak Z = 2 0.2 0 0.1 0 0 Z = 1 0.2 0 0.4 0 0.001 Z = 0 0.6 1 0.5 1 0.999 Table 3.7: An example of Noisy-MAX distribution where P (Z = 0 X 1 = 1, X 2 = 1) = 0.6 0.5 0.999 = 0.2997, P (Z = 1 X 1 = 1, X 2 = 1) = 0.2 0.5 (0.999+0.001)+0.4 0.6 (0.999+0.001)+0.2 0.4 (0.999+0.001)+0.6 0.5 0.001 = 0.4203, etc (calculated from Table 3.8). Thus we can extend the number of states in the effect variable as well as the number of cause variables as we want in Noisy-MAX model if necessary. In the Noisy-OR and Noisy-MAX model, the conditional probability of Z under different combinations of X i has internal interactions. Generally, the effect node can depend on cause nodes in arbitrary ways. Not every CPT can be represented by Noisy-MAX perfectly. This indicates that in many cases there might be no such interactions with CPT. Therefore, such CPT cannot be transformed into a Noisy-OR or Noisy-MAX model directly. Adam Zagorecki and Marek Druzdzel [24] proposed an algorithm that can fit a Noisy-OR or Noisy-MAX distribution to an arbitrary CPT. It provides an approximation if the set of Noisy-OR (Noisy-MAX) parameters producing a CPT that is the closest to the original CPT. In our business structure model, using this algorithm is not needed so far, but it still provides a good support if further research requires. 3.4 Application of a Bayesian Network in Business Structure In the previous sections we described the foundations of Bayesian networks and local probabilistic models using conditional probability distribution on individual nodes. This section will show an application on our business structure as in Figure 2.2. We build two Bayesian networks in GeNIe 2.0, a graphical interface that can be used to create decision theoretic models [25]. Notice that the models are not perfect, but they give us an estimation of risk impacts flowing through certain structures. X 1 X 1 X 2 X 2 1 0 1 0 Z = 2 0.2799 0.2 0.0999 0 Z = 1 0.4203 0.2006 0.4005 0.001 Z = 0 0.2997 0.5994 0.4995 0.999 Table 3.8: Transformation of Noisy-MAX distribution 30
CHAPTER 3. BCRA: MODEL DESCRIPTION Under Basel II, a list of categories of operational risk has been officially defined. Risks that BCM is interested in are also included in that list, and we choose and import some typical ones into our models. The first model concerns building risks including earthquake, flood, heavy storm, fire, power outage, terrorism and pandemic (we categorize them as building risks to simplify the model). These risks, as well as some mitigation methods, are related to buildings, which are located in Netherlands and several other countries all over the world. Different business processes are linked to those buildings through Business Process Process Business Line Building paths. Notice that natural disasters such as earthquake, flood and heavy storm are regional risks that differ in certain continents. Individual buildings risks, for example, fire and power outage, have different probabilities among different buildings. Since we can hardly acquire and distinguish all the information of buildings (like floor areas, energy outsourcer) within our interest, we set all those risks to equivalent impacts. Pandemic risk can occur randomly anywhere and in forms that are difficult to predict, hence we set the probability equal in every continent and have some probability spread to other continents. This is the global structure of the first model (Figure 3.4): red nodes are risk nodes, orange nodes are mitigation method nodes, blue nodes are building nodes, and yellow nodes are business process nodes. Figure 3.4: Building Risk Model For each risk node, we set three state levels that reflect different severities of impact, and two states on building and business process nodes that determine whether they fail or not. We set risk nodes, mitigation method nodes and building nodes as chance nodes, while the node type of business process is Noisy-OR. The reason why the Noisy-OR model is used on business process nodes is that some business process has many buildings as parent nodes which give rise to a 31
CHAPTER 3. BCRA: MODEL DESCRIPTION huge CPT. Therefore we can avoid the occurrence of error when CPT is open or the inference algorithm is running. Notice that warning information will be displayed in GeNIe if we add more than 20 building nodes to a single business process. To handle this, we can separate a target business process node in two or three sub-nodes and merge them together. Each of the separated nodes connects to fewer buildings under the limit. Figure 3.5 has illustrated the CPD of risk nodes, building nodes and business process nodes in different types. The former one is a chance node, and the latter is Nosy-OR. The conditional probabilities filled in the building nodes are reasonable guessing because of lack of knowledge of probabilistic dependencies among disasters and mitigation methods. This circumstance also appears in business process nodes therefore we set all the conditional probabilities of failure due to damage in an individual building to 0.2, while the normal state is 0.8. In the real world, the probabilities depend on the criticality of buildings. These probabilities are initial input and should be determined by experts in the future. Although there should be differences among all the buildings, we treat them equal in this model. The probabilities of the risks are estimated from data, which will be introduced in the next chapter. Figure 3.5: CPD of Risk Nodes, Building Nodes and Business Process Nodes in Different Types Let us introduce the second model, which describe risks on applications that are connected to a large numbers of processes. Two types of risks are defined: power outage and hack, along with some mitigation methods. Similarly, we set the type of node in process as Noisy-OR and others as chance. Compared with the business process nodes in the first model, the conditional probabilities of fail in process nodes are all set to 0.8, while the normal states are set to 0.2. An explanation by experts is that failure in one application is more likely to influence the operation of processes than damages in one building. All the leak parameters in Noisy-OR are determined as 0.0001 but could be set higher if more important risk categories are known. The two models summarize how risks of BCM influence business processes or processes within a bank. These are the basis of further analysis in the following chapters. We will see how parameter learning works in these models and inference algorithm that can be implemented in 32
CHAPTER 3. BCRA: MODEL DESCRIPTION Figure 3.6: Application Risk Model 33
CHAPTER 3. BCRA: MODEL DESCRIPTION GeNIe in Chapter 4 and Chapter 5. 34
Chapter 4 BCRA: Risk Probability Analysis In the previous chapter, we have seen that models are built based on the business structure within a bank. In this chapter, we will see how to estimate the risk probability from data using two approaches: one is to fit the data to certain probability distributions and acquire the probability of each state by determined thresholds; another one is to learn the data directly in the Bayesian network to generate a posterior distribution under prior knowledge. Most of the data analysis in this chapter is implemented in Excel and Matlab. 4.1 Risk Data In the model we described in Chapter 3, we have included natural disaster risks include earthquake, flood and heavy storm, data on which is available on government or research institution websites. Fire and pandemic risks data is not complete, while other risks we are interested in are not available so far. In this section, we will have a look at the existing risk data that can be used to estimate the threat probabilities. For natural disasters, we have obtained data that are related to the Netherlands and to the US. 4.1.1 Natural Disaster Models that measure the impact of natural disasters are usually complicated. In the context of BCM, we are only interested in a rough estimation of risks without using geological knowledge. In order to evaluate the impact of certain disasters easier, we can use some basic metrics that quantify the level of them. For example, earthquakes are mostly measured by Richter magnitude scale implying that higher levels indicate more severe impacts. Heavy storm measurement can be focused on the wind speed (m/s), with a threshold that damage would be caused by highspeed wind. Flood, however, has two common measures: level (m) and flow strength (cubic feet per second, cft/s). Since we are not experts of geological science, it is convenient for us to use data of these metrics, which are easier acquired and understand. They are prone to be fitted in some probability distributions or be learned directly in a Bayesian network. The data used in our model are derived from KNMI (Koninklijk Nederlands Meteorologisch Instituut, Netherlands data) [26] and USGS (U.S. Geological Survey, US data) [27]. Our time span of data varies over the risk categories, limited by the website database. 35
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS Figure 4.1: Netherlands earthquake magnitude data from KNMI, 1990 to 2014 YYMMDD TIME LOCATION LAT LON INT MAG. DEPTH 19930212 114600.76 Noordbroek 53.295 6.868 1 1 3 19930215 104658.55 Linne/Roermond 51.148 5.965 3 2.9 8.4 19930305 222725.22 Langelo 53.085 6.465 2.5 1.5 3 19930306 194353.61 Roermond 51.151 5.965 1 1.6 5 19930312 221241.52 Hoogezand 53.16 6.805 1 0.9 3 19930326 183421.16 Overschild 53.285 6.795 1 1.1 3 19930505 200832.79 Haren 53.177 6.685 1 1.5 3 19930514 193942 Ten Post 53.305 6.793 1 1.1 3 19930603 125739.09 Linne/Roermond 51.161 5.957 3 3.4 10.6 19930611 170435.6 Stramproy 51.207 5.755 1 2.5 17.3 19930627 20851.8 Bedum 53.317 6.65 1 1.4 3 19930627 25710.06 Stedum 53.315 6.66 1 1 3 19930710 2234.51 Appingedam 53.333 6.837 1 1.4 3 19930727 133918.07 Loppersum 53.336 6.808 1 0.8 3 19930823 5121.69 Nijenklooster 53.332 6.848 1 0.7 3 19930904 22450.15 Oldenzijl 53.363 6.765 1 1.4 3 19930922 173703.82 Middelstum 53.368 6.675 2.5 2 3 19930925 2133.46 Slochteren 53.208 6.812 1 0.9 3 19930930 231556.23 Linne/Roermond 51.149 5.938 2.1 10.3 19931123 123147.68 Slochteren 53.202 6.82 2.5 2.2 3 19931213 85953.96 Noordzee 55.181 4.563 3.3 27.7 19931222 20442.79 Ten Post 53.294 6.753 1 1.6 3 19931225 55622.84 Posterholt 51.136 6.017 1.6 14 Table 4.1: Details of Netherlands earthquake data in 1993 36
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS Locatie Datum Tijd Waarde Eenheid Locatie Datum Tijd Waarde Eenheid Katerveer 2000/10/12 0:00 39 cm Katerveer 2000/10/12 12:00 40 cm Katerveer 2000/10/12 1:00 40 cm Katerveer 2000/10/12 13:00 41 cm Katerveer 2000/10/12 2:00 39 cm Katerveer 2000/10/12 14:00 42 cm Katerveer 2000/10/12 3:00 39 cm Katerveer 2000/10/12 15:00 42 cm Katerveer 2000/10/12 4:00 40 cm Katerveer 2000/10/12 16:00 43 cm Katerveer 2000/10/12 5:00 40 cm Katerveer 2000/10/12 17:00 43 cm Katerveer 2000/10/12 6:00 40 cm Katerveer 2000/10/12 18:00 43 cm Katerveer 2000/10/12 7:00 40 cm Katerveer 2000/10/12 19:00 42 cm Katerveer 2000/10/12 8:00 39 cm Katerveer 2000/10/12 20:00 43 cm Katerveer 2000/10/12 9:00 39 cm Katerveer 2000/10/12 21:00 44 cm Katerveer 2000/10/12 10:00 41 cm Katerveer 2000/10/12 22:00 45 cm Katerveer 2000/10/12 11:00 40 cm Katerveer 2000/10/12 23:00 44 cm Table 4.2: Details of flood data from KMNI, Katerveer, the Netherlands, 12/10/2000 Specifically, earthquake data can have a 100 year range of time but few records in early years; wind data are chosen from 1980 to 2014 in the Netherlands but in the US they start in 1995; flood data have a fixed range of 10 years (2000 to 2009), ignoring the annually increases in sea level, which are insignificant disturbances related to the water data. Table 4.1 is part of a dataset of Netherlands earthquake in 1993, showing some typical earthquake information in each record. We only analyze magnitudes in the penultimate column. Table 4.2 illustrates details of water level data, taking Katerveer as an example, including water height in every hour. Notice that for the flood data we choose 10 water stations that are located in different districts in the Netherlands. Then we perform a statistical analysis on them separately and summarize the results to obtain the final probability (this will be done in the next section). STATION ST AT ION N AME DATE AWND WSF2 GHCND:USW00094789 NEW YORK J F KENNEDY INTERNATIONAL AIRPORT NY US 20010108 35 67 GHCND:USW00054787 FARMINGDALE REPUBLIC AIRPORT NY US 20010108 25 58 GHCND:USW00014734 NEWARK INTERNATIONAL AIRPORT NJ US 20010108 38 76 GHCND:USW00054743 CALDWELL ESSEX CO AIRPORT NJ US 20010108 8 31 GHCND:USW00094728 NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US 20010108 28 67 GHCND:USW00014732 NEW YORK LAGUARDIA AIRPORT NY US 20010108 38 72 GHCND:USW00094741 TETERBORO AIRPORT NJ US 20010108 25 45 GHCND:USW00094789 NEW YORK J F KENNEDY INTERNATIONAL AIRPORT NY US 20010109 76 130 GHCND:USW00054787 FARMINGDALE REPUBLIC AIRPORT NY US 20010109 46 94 GHCND:USW00014734 NEWARK INTERNATIONAL AIRPORT NJ US 20010109 67 107 GHCND:USW00054743 CALDWELL ESSEX CO AIRPORT NJ US 20010109 39 89 GHCND:USW00094728 NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US 20010109 42 76 GHCND:USW00014732 NEW YORK LAGUARDIA AIRPORT NY US 20010109 70 116 GHCND:USW00094741 TETERBORO AIRPORT NJ US 20010109 53 94 GHCND:USW00094789 NEW YORK J F KENNEDY INTERNATIONAL AIRPORT NY US 20010110 76 134 GHCND:USW00054787 FARMINGDALE REPUBLIC AIRPORT NY US 20010110 51 112 GHCND:USW00014734 NEWARK INTERNATIONAL AIRPORT NJ US 20010110 60 125 GHCND:USW00054743 CALDWELL ESSEX CO AIRPORT NJ US 20010110 32 89 GHCND:USW00094728 NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US 20010110 39 94 GHCND:USW00014732 NEW YORK LAGUARDIA AIRPORT NY US 20010110 68 125 GHCND:USW00094741 TETERBORO AIRPORT NJ US 20010110 44 98 Table 4.3: Details of wind data in 0.1 m/s from USGS, New York, the USA, 08/01/2001 to 10/01/2001 Similarly, wind speed data are gathered from 36 different stations, but they have been gathered by KNMI automatically. Moreover, we prefer data of maximum in two-minutes than that of daily mean because high wind speed in a short time is a better indicator that is likely to represents heavy storm within a day. For US data, we are only interested in that of New York. Table 4.3 shows the wind speed data of New York wind stations, both in daily average and maximum of two-minutes, during three days on January, 2001. Apart from the Netherlands and US data, India (Mumbai) and South Africa (Johannesburg) 37
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS have only historical earthquake data that can be found in USGS. But data of Eastern Asia (Hong Kong and Singapore) are not collected. Now we turn to see how to define the threshold of disasters that indicate whether they can cause damages to buildings. Generally, in Richter magnitude scale, earthquakes that have 5 or larger magnitude would cause considerable damage, while those with at least 6 have severe impacts. Wind speed also has a standard: Beaufort scale [28], which is an empirical measure that relates wind speed to observed conditions at sea or on land. We set 20 m/s as the limited impact threshold and 24 m/s as the severe impact threshold. The situation for flood is complicated: water levels or flow strengths vary in rivers or seas [29]. Hence we need water references that support our settings, which can also be obtained from KNMI. Figure 4.2 is the reference page of Katerveer, and the corresponding two thresholds we set are 345 cm (1 per 10 years) and 400 cm (1 per 100 years) [30]. Figure 4.2: Water level and flow strength reference example of Katerveer, the Netherlands 4.1.2 Other Risks In our Bayesian network, we only have three sufficient datasets that can be used to perform probability estimation. Other risks like building fire, we have data reports from Statistics Netherlands [31], but lack of information about office building floor areas that are indispensable 38
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS for analysis. Pandemic data, however, are too rough so that we can only estimate the occurrence possibility globally. For those risks we have no data resource, we use reasonable guessing on their probabilities. Expert opinions, or any relative information, can be supplements for our datasets. 4.2 Distribution Fitting One common way to predict the probability of occurrence of the impact of a risk in a certain interval is distribution fitting. We choose one year as the time interval. Many probability distributions may be chosen to fit the data we use. Therefore we need to select the distribution that fits the data best. 4.2.1 Goodness of Fit To judge which distribution should be selected, we may want to test whether a set of data fits a certain distribution. In this section we will introduce a famous test: the non-parametric Kolmogorov-Smirnov test. The KolmogorovSmirnov test (KS test) is a nonparametric test for the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample KS test), or to compare two samples (two-sample KS test) [32]. To adopt the K-S test we first introduce the concept of the Kolmogorov-Smirnov statistic (K-S statistic): Definition 4.2.1. (Kolmogorov-Smirnov statistic) For any given data X = {X i : i = 1,..., n}, the Kolmogorov-Smirnov statistic for a given cumulative distribution function (CDF) F (x) is defined as D n = sup F n (x) F (x) x where sup x is the supremum of the set of distances. distribution function (ECDF) defined as F n (x) is the empirical cumulative F n (x) = 1 n n i=1 I Xi x where I Xi x is the indicator function, equal to 1 if X i x and equal to 0 otherwise. By the Glivenko-Cantelli theorem, if the sample comes from distribution F (x), then D n converges to 0 almost surely in the limit when n goes to infinity. The Kolmogorov-Smirnov statistic quantifies the distance between ECDF of the data and CDF of certain distribution. It can also measure distance between two ECDF of data by proper modifications. The smaller of D n indicates a better distribution fitting for the data. Before introduce the Kolmogorov-Smirnov test, we first give the definition of Brownian Bridge. 39
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS Definition 4.2.2. (Brownian Bridge) A Brownian bridge is a continuous-time stochastic process B(t) whose probability distribution is the conditional probability distribution of a Wiener process W (t) given the condition that B(1) = 0, i.e: B(t) := (W (t) W (1) = 0), t [0, 1] Proposition 4.2.3. If W (t) is a standard Wiener process, then is a Brownian bridge for t [0, 1]. Proof. See [33]. B(t) = W (t) tw (1) Now we give the definition of Kolmogorov-Smirnov distribution. Definition 4.2.4. (Kolmogorov distribution) The Kolmogorov distribution is the distribution of the random variable K = sup t [0,1] B(t) where B(t) is the Brownian bridge. The CDF of K is given by P (K x) = 1 2 ( 1) k 1 e 2k2 x 2 2π = x k=1 k=1 e (2k 1)2 π 2 Given all these definitions we can then proceed to the one-sample K-S test. The null hypothesis is that the samples x i are drawn from the distribution F (x), under which we have the following theorem 8x 2 Theorem 4.2.5. (Kolmogorov Theorem) If F (x) is continuous, then under the null hypothesis above, nd n converges to the Kolmogorov distribution, where D n is the K-S statistic. Proof. See [34]. The goodness-of-fit test or the KolmogorovSmirnov test is constructed by using the critical values of the Kolmogorov distribution. To implement one-sample K-S test, we can apply the following steps: Obtain the data X = {x i : i = 1,..., n} and compute the ECDF F n (x) Choose a target F (x) for which we would like to test Choose a confidence level α Compute K-S statistic D n The null hypothesis is rejected at level α if ndn > K α where K α is found from P (K K α ) = 1 α In some cases, we cannot accept the null hypothesis even if we choose the target distribution that minimizes D n. Thus we can just compute the K-S value and choose the minimum one for all the distribution of interest and ignore their test results. 40
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS 4.2.2 Pareto Distribution Many types of statistic data have a characteristic that data are symmetrically distributed around the mean while the frequency of occurrence of data farther away from the mean diminishes. A typical distribution is the normal distribution, where mean and median are equal. In our model, for natural disaster data this is not the case. For instance, smaller earthquake magnitudes have larger frequencies than larger magnitudes. This implies that the larger values tend to be farther away from the mean than the smaller values, in other words, it has a skew distribution to the right. Typical distributions are the lognormal distribution, the log-logistics distribution, the gamma distribution and the Pareto distribution. The last one is usually bounded to the left. Sometimes we may not have sufficient data that values can reach a certain threshold which disaster can cause considerable damages. It is likely to underestimate the probability of such risk happen if lacking destructive disaster data. Cumulative probability of right-skewed distributions often converge to 1 slower than that of symmetric ones, which evaluate the probability of extreme values more accurate. Like other right skewed distributions, the Pareto distribution is a good estimation to extreme event probability because of its heavy tail [35]. Applications of the Pareto distribution and its extensions in earthquake magnitude, water level and wind speed have been widely conducted. In this section we introduce the background of the Pareto distribution, and its extension, generalized Pareto distribution. The Pareto distribution is a power law probability distribution with the attribute of scale invariance. Its definition is given as follow [36]. Definition 4.2.6. (Pareto distribution) If X is a random variable with a Pareto distribution, then the probability that X is greater than some number x, i.e. the survival function, is given by { ( F x (x) = P (X > x) = x m ) α, x x m 1, x < x m where x m (x m > 0) is the minimum possible value of X, and α is a positive parameter. The Pareto distribution is characterized by a scale parameter x m and a shape parameter α, which is known as the tail index. The cumulative distribution function is given by { 1 ( F x X (x) = x m ) α x x m (4.1) 1, x < x m And the density function is f X (x) = { ( α x m )( x x m ) (α+1) x x m (4.2) 1, x < x m The Pareto distribution is a scale free distribution in the sense that f(cx) = ( α x m )( cx x m ) ( α+1) = c (α+1) ( α x m )( x x m ) (α+1) f(x) (4.3) Note that both the survival function and the density function are power law functions. The plot of a power law function is inverse J shape, while it is a linear function in a log-log plot. 41
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS The mean value and variance of a random variable following a Pareto distribution are {, α 1 E(X) = αx m α 1, α > 1 (4.4) { V ar(x) =, α (1, 2] ( xm α 1 )2 α α 2, α > 2 (4.5) If we have a set of data {x i : i = 1,..., n}, and we fit the data to Pareto distribution, we can use maximum likelihood method to estimate the exponent α. The estimated value of α is given by [ α = n The proof of this result will be showed in Appendix A. i ln x ] 1 i (4.6) x m In practice, we usually specify a range of x m, and repeat (4.6) to estimate α under different x m. Generally, we choose the combination of x m and α that makes the fitting quality best (smallest K-S statistic). To avoid possible unrealistic results, we exclude those combinations that x m are too close to the threshold we set in spite of a small K-S statistic. Now we turn to the definition of generalized Pareto distribution. Definition 4.2.7. (Generalized Pareto distribution) If X is a random variable with a generalized Pareto distribution, then the probability that X is greater than some number x, i.e. the survival function, is given by { (1 + F α(x µ) (x) = P (X > x) = σ ) 1 α, x µ 1, x < µ where µ R, σ > 0, α R. Here µ is the location parameter. The x m in the Pareto distribution is replaced by σ here. If α = 0 and µ = 0, the generalized Pareto distribution is equivalent to the exponential distribution. If k > 0 and µ = σ α, the generalized Pareto distribution is equivalent to the Pareto distribution. 4.2.3 Application We use 1494 earthquake magnitude data in the Netherlands from 1906 to 2014 where most of the records fall into 1990 or later. Some values are 0 which we exclude otherwise we cannot choose some distributions that do not allow non-positive input data. Initially, the distributions we choose and compare are three right skewed distributions including gamma distribution, lognormal distribution, log-logistic distribution, and one symmetric distribution, normal distribution. The fitting results are showed in Table 4.4 and Figure 4.3 and Figure 4.4. Because the p-values of all the distribution are very low such that the null hypothesis is always rejected, we only compare the values of K-S statistic. In Table 4.4, log-logistic distribution has the lowest K-S statistic value, and its parameter estimations also have the smallest standard deviation. Moreover, right skewed distributions have better fitting quality than that 42
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS Distribution K-S a(mu) Std. Err. b(sigma) Std. Err. gamma 0.0636 3.42517 0.120858 0.406449 0.015446 lognormal 0.0973 0.17783 0.015767 0.603691 0.011155 log-logistic 0.0522 0.221148 0.014294 0.318756 0.007021 normal 0.1082 1.36606 0.019323 0.746875 0.01367 Table 4.4: Fitting Results of Four Distributions Figure 4.3: ECDF and Fitted CDF Figure 4.4: Density of Four Fitted Distributions with the Empirical Data 43
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS of symmetric (normal distribution). In Figure 4.4, all the right skewed distributions have tails that are heavier than that of normal. Thus we can see right skewed distribution is a better choice to investigate extreme probability [37]. However, one should notice that data between 0 and 1 are fewer than we expect, since earthquakes with those magnitudes are thought to have the largest number. One reason is that some minor earthquakes are not easily detected thus they are not included in the dataset. Weak earthquakes, like that below magnitudes 2, are highly likely to occur every year. If we only look at earthquakes with scale that is larger than a certain value such as 2, the defects of data may be overcome. Therefore we do the fitting again, to the same distributions, but with a lower limit 2 on data. Distribution K-S (no limit) K-S (limited) gamma 0.0636 0.2 lognormal 0.0973 0.2026 log-logistic 0.0522 0.1924 normal 0.1082 0.2025 Table 4.5: Comparison of K-S Statistic Values, Original vs Limited Figure 4.5: ECDF and Fitted CDF with Limit It is clear that the K-S statistic values are much larger than before. As we described at the beginning of Section 4.2.2, the Pareto distribution is a good distribution that can handle data bounded to the left. To verify whether our intuition that the Pareto distribution is better for this data fitting, we use maximum likelihood estimation to obtain the optimized parameters. Figure 4.6 shows that in log-log scale plot, data that is larger than 2 are approximately linear distributed. Choosing different x m s, we obtain different α, and corresponding K-S statistic are computed. The parameter combination with lowest K-S statistic is α = 6.511, x m = 2.4, and its K-S statistic is 0.1161, which is much lower than that of in the last column of Table 4.5. The result has confirmed our intuition. 44
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS alpha x m K-S 5.3089 2 0.1413 5.4919 2.1 0.1193 5.9387 2.2 0.1422 6.184 2.3 0.1333 6.511 2.4 0.1161 7.0247 2.5 0.1397 7.4358 2.6 0.1552 7.7533 2.7 0.1856 7.7669 2.8 0.1807 8.1657 2.9 0.1702 9.9049 3 0.2381 Table 4.6: Parameter combinations of Pareto distribution with different K-S statistic values Figure 4.6: Log-log Scale Plot and Fitted Distributions 45
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS Let us look at the cumulative probability specifically. By the definition of Pareto distribution, we can calculate the probability in every interval. Taking the optimized parameters, we have the result in Table 4.7. P (2.4 < X < 5) P (X > 5) P (5 < X < 6) P (X > 6) Probability 0.9824886 0.017511 0.0111 0.006411 Years 57.10571 90.09019 155.9723 Table 4.7: Probability of Earthquake with Different Magnitude Intervals One can read from Table 4.7 that earthquake that have a magnitude larger than 5 are estimated to happen once in 57 years, while the ones with a magnitude larger than 6 once in 156 years. More results of other risk probabilities are listed in Appendix B,. 4.3 Parameter Uncertainty In the previous section, we have estimated the parameters of the Pareto distribution. These are selected as the fitting distribution to the data we are interested in. However, the integrity of data is likely to influence our estimation. To evaluate the accuracy of the parameters, we introduce the bootstrap methods to build a confidence interval, including the basic bootstrap, percentile bootstrap, and the bias-corrected and accelerated (BC a ) bootstrap [38]. 4.3.1 Bootstrap Confidence Interval We use bootstrap methods to build a confidence interval for the α in Section 4.2.3 [38]. Consider ˆα = ˆα(X 1,..., X n ) is an estimator of α(f ), depending on a sample X 1,..., X n, with common distribution function F, which is unknown. The basic principle of the bootstrap is to first use the observations X 1,..., X n to estimate the underlying unknown function F, by an estimator say F n. Then we estimate the distribution of ˆα(X 1,...,X n) α(f ) n by ˆα(X 1,...,X n) α(f n) n, where X1,..., X n given X 1,..., X n is a sample of size n from the distribution F n. We denote ˆα = ˆα(X 1,..., X n). Hence we estimate the distribution function by the random distribution function G n (x) = P ( ˆα(X 1,..., X n ) α(f ) n x) (4.7) G n(x) = P ( ˆα(X 1,..., X n) α(f n ) n x X 1,..., X n ) (4.8) The estimator α(f n ) is called bootstrap estimator, and G n is called the bootstrap approximation of G n. We say that the bootstrap is asymptotically consistent, or that the bootstrap works, if P ( lim sup G n (x) G n n(x) = 0) = 1 (4.9) x Now we turn to see how to build a basic bootstrap confidence interval. Suppose that (4.7) and (4.8), and (4.9) hold. We can approximate the unknown percentiles of G n by the known percentiles of G n. Define c n, θ 2 = G 1 (θ n 2 ), c n,1 θ 2 46 = G 1 ( θ n 1 2 ) (4.10)
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS and We have c n, θ 2 c n, θ 2 c = G 1( θ n, θ n 2 2 ), c = G 1( n,1 θ n 1 θ 2 2 ) (4.11) and c c n,1 θ n,1 θ. This implies 2 2 P ( c n, θ 2 ˆα α n c n,1 θ 2 ) P ( cn, θ 2 = G n ( cn,1 θ 2 ˆα α n ) c n,1 θ 2 ) ) Gn ( cn, θ 2 = 1 θ 2 θ 2 = 1 θ (4.12) Since We have the confidence interval ( c n, θ 2 ˆα α c ) n n,1 θ 2 α (ˆα nc, ˆα nc n, θ n,1 θ 2 2 ) (4.13) (4.14) Writing the bootstrap approximation G α n(x) = P ( ˆ ˆα n x), and b n,p for the p-th percentile of the distribution of ˆα, then for p (0, 1), we have b n,p ˆα n = c n,p (4.15) From (4.15) it now follows that (4.14) is equal to ( 2ˆα b, 2ˆα b ) n, θ n,1 θ 2 2 (4.16) This is the basic bootstrap confidence interval. By the Central Limit Theorem[27], we know is asymptotic normal distributed when n. Let us define that ˆα(X 1,...,X n) α n d n,1 θ 2 = c, d n, θ n, θ 2 2 = c n,1 θ 2 (4.17) As a result ˆα b n,p n = d n,p (4.18) From (4.17) and (4.18) it now follows that (4.16) is equivalent to This gives the percentile bootstrap confidence interval. (b, b ) (4.19) n, θ n,1 θ 2 2 In practice, b and b can be obtained by ˆα n, θ n,1 θ 1,..., ˆα m, which are m bootstrap replications 2 2 of ˆα. They are α 2 m-th and (1 α 2 )m-th order statistics of ˆα 1,..., ˆα m. If the bootstrap distribution of an estimator is symmetric, then the percentile confidence interval is often used. However, if the bootstrap distribution is non-symmetric, then the percentile confidence-intervals are often inappropriate. Bias in the bootstrap distribution will lead to bias in the confidence-interval. Therefore, we have to find a better confidence interval in this situation. 47
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS In order to adjust the bias and skewness in the bootstrap distribution, we introduce the biascorrected and accelerated bootstrap method (BC a ) to compute the confidence interval [39]. Let us consider the estimator ˆα f α, which is a one-parameter family of densities f α (ˆα) for the real valued statistic ˆα. By definition, the parametric bootstrap distribution in this situation has We also define the CDF of the bootstrap distribution G(s) = s ˆα fˆα (4.20) fˆα (t)dt = P (ˆα s) (4.21) Suppose there exists a monotone-increasing transformation g and constants z 0 and a such that ˆφ = g( ˆφ), φ = g(α) (4.22) satisfy ˆφ = φ + σ φ (Z z 0 ), Z N(0, 1) (4.23) with σ φ = 1 + aφ (4.24) Here z 0 is called bias constant and a is called accelerated constant. And σ φ is the estimate of the standard deviation of ˆσ. We assume that φ > 1 a if a > 0 in (4.24), so σ φ > 0, and likewise φ < 1 a if a < 0. The constant a will typically be in the range a < 0.2, as will z 0. Let Φ denote the standard normal CDF, and let G 1 (θ) denote the (1 θ)-th percentile of the bootstrap CDF (4.21). We will obtain the confidence interval in the following lemma. Lemma 4.3.1. (BC a bootstrap confidence interval) Under conditions (4.22)-(4.24), the BC a bootstrap confidence interval of level 1 2θ for α is where ( G 1 (Φ(z[θ])), G 1 (Φ(z[1 θ]))) and likewise for z[1 θ]. z[θ] = z 0 + (z 0 + z (θ) ) 1 a(z 0 + z (θ) ) Here z (θ) is the θ-th percentile point of standard normal distribution. Proof. See [36]. If z 0 and α equal 0, then z[θ] = z (θ), and the confidence interval becomes ( G 1 (θ), G 1 (1 θ)), a percentile confidence interval. In general z 0 and α are not equal zero, thus BC a is an adjustment to the percentile method. 48
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS Figure 4.7: Bootstrap Replications of α Figure 4.8: QQ-Plot of α and log(α) 49
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS 4.3.2 Application We consider using the data in Section 4.2. The estimated parameter α equals 6.511, under the condition x m = 2.4. Using the bootstrap method, we obtain 10000 values of α, and compute its corresponding 95% level confidence interval. The results are shown in Figure 4.7 and Figure 4.8. The two figures illustrate the bootstrap replications of and its symmetry. The QQ-Plot shows that if the distribution of α is normal, the plot will be close to linear. Then we can read from the plot that intermediate α values are approximately close to normal distributions, while some large values have deviations to some extent. Therefore it is likely that the distribution of α is a little asymmetric and skewed. Bootstrap method Left limit Right limit Basic 6.431806 6.745797 Percentile 6.276203 6.590194 BC a 6.4369 6.6529 Table 4.8: 95% Confidence intervals of α Table 4.8 shows that the basic bootstrap method overestimates the upper limit of confidence interval, while the percentile method underestimates the corresponding values. Using the limit values of BC a to compute the probabilities, we can obtain Table 4.9 that earthquakes with a magnitude larger than 5 are likely to happen between 1 in 54 and 63 years, while the probability of a magnitude larger than 6 is between 1 in 146 and 178 years. Lower bound Upper bound P (2.4 < X < 5) P (X > 5) P (5 < X < 6) P (X > 6) 0.98151 0.01849 0.011628 0.006862 54.08284 85.99693 145.7337 P (2.4 < X < 5) P (X > 5) P (5 < X < 6) P (X > 6) 0.984221 0.015779 0.01015 0.00563 63.37404 98.52571 177.6295 Table 4.9: Lower bound and upper bound for the 1 in X years earthquake 4.4 Parameter Learning Distribution fitting gives us a decent method to calculate the probability of risks given data. In the Bayesian network, we can implement parameter estimation directly so that probability of risks can be updated by data with prior information. In this section, we will see how to learn parameters in GeNIe, given the original prior distributions. 4.4.1 Dirichlet Distribution Before explaining Bayesian parameter estimation, we first introduce the Dirichlet distribution, which is usually used as the prior distribution for probability vectors [40]. The advantage 50
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS of a Dirichlet distribution is that if the data have a multinomial distribution, the posterior distribution is also Dirichlet distributed such that information can be updated in closed form using sufficient statistics from the data. Definition 4.4.1. (Dirichlet Distribution) The Dirichlet distribution of order k 2 with parameter α 1,..., α k > 0 has a probability density function on R k 1 given by f(x 1,..., x k 1 ; α 1,..., α k ) = 1 B(α) k i=1 x α i 1 i where x 1,..., x k 1 > 0, x 1 +... + x k 1 < 1, x k = 1 x 1... x k 1. The normalizing constant is the multinomial beta function, which can be expressed in terms of the gamma function k i=1 B(α) = Γ(α i) Γ( k i=1 α i), where α = (α 1,..., α k ) are called hyperparameters, Γ(t) = 0 x t 1 e x dx. Suppose a node in a Bayesian network is a random variable that has multiple states. We first assume that a parameter vector θ = (θ 1,..., θ k ) representing the probability distribution of the states in each node is Dirichlet distributed. The prior probability is then given by P (θ). If we have a dataset X = {X 1,..., X k }, by Bayes theorem, the posterior probability is P (θ X) = P (θ)p (X θ) P (X) (4.25) where P (X θ) is the likelihood function, and P (X) the marginal distribution of the data. We have θ Dirichlet(α 1,..., α k ), if P (α) 1 k θk k. Intuitively, hyperparameters correspond to the number of samples we have seen. We use α = j α j, the confidence strength of prior belief. If α is large, we have more confidence on our prior knowledge. If we use a Dirichlet prior, then the posterior is also Dirichlet. Proposition 4.4.2. If P (θ) is Dirichlet(α 1,..., α k ), then P (θ X) is Dirichlet(α 1 + X 1,..., α k + X k ), where X k is the number of occurrences of x k. Proof. See [14]. Priors such as the Dirichlet are useful, since they ensure that the posterior has a nice compact description. Moreover, this description uses the same representation as the prior. They are called conjugate priors. This phenomenon is a general one and the one that we strive to achieve, since it makes our computation and representation much easier. 4.4.2 Bayesian Estimation In this section we will see how to predict the values of random variable using given data with a Dirichlet prior. The expectation of Dirichlet(α 1,..., α k ) [37] is given as E(θ i ) = α i α (4.26) 51
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS By Proposition 4.4.2, it is easily to obtain the expectation of Dirichlet(α 1 + X 1,..., α k + X k ) E(θ i X) = α i + X i α + X (4.27) Here we define X = j X j, which is the sample size. We call the prediction (4.27) Bayesian estimation or maximum a posteriori (MAP) estimates. The update of the posterior distribution is called learning that we learn the probability through the data with some prior knowledge. From (4.27), we can know that for fixed α, the larger of α i α is, the slower the change of (4.26). If X is very large, say, X, Bayesian estimation is close to MLE. We can see more examples of interactions among the factors in the next section. In some cases, the dataset is not complete, missing or unobserved values exist. A good algorithm called expectation maximization can estimate the probability based on the network structure and the dataset even if there are missing values. Expectation maximization algorithm (EM) is an iterative method for finding maximum likelihood (ML) or maximum a posteriori (MAP) estimates of parameters in statistical models, which is also applicable in Bayesian networks [41]. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. Algorithm 4.4.3. (EM algorithm) Given a set X of observed data, a set of unobserved latent data or missing values Z, and a vector of unknown parameters θ, along with a likelihood function L(θ; X, Z) = P (X, Z θ), the MLE estimate of the unknown parameter is determined by the marginal likelihood of the observed data L(θ; X) = P (X θ) = Z P (X, Z θ) To avoid the number of values to grow exponentially, the EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps: Expectation step (E step): Calculate the expected value of the log-likelihood function, with respect to the conditional distribution of Z given X under the current estimate of the parameters θ (t) Q(θ θ (t) ) = E Z X,θ (t)(logl(θ; X, Z)) Maximization step (M step): Find the parameter that maximizes this quantity θ (t+1) = arg max θ Q(θ θ(t) ) Proof of correctness of EM algorithm can be found in [42]. If the nodes in a Bayesian network are discrete (have a countable number of states), and the data points are continuous (taking values in an uncountable infinite set), one should discretize the dataset into intervals to match the thresholds we set. Although discretization of continuous variables may result in potential loss of statistical accuracy, it is still a convenient method to help in parameter estimation. 52
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS 4.4.3 Application We use the EM algorithm to learn the previous dataset for estimating Europe Earthquake probability in GeNIe. Note the data we use is thought to be complete without missing values. After discretization, 1492 of 1494 values fall into the interval below 5, 2 of 1494 fall into 5 to 6, none fall into above 6. The prior distribution we set is Dirichlet, with confidence strength α varying. Note that other nodes should be set fixed; otherwise they would be treated as missing values in EM learning. There are three states that should be considered: none, limited, severe. Their probability ratio is fixed that α 1 : α 2 : α 3 = 95 : 4 : 1, and the data ratio is X 1 : X 2 : X 3 = 1492 : 2 : 0. We choose 7 levels of α and compare their learning results in Table 4.10 and Figure 4.9. α α 1 α 2 α 3 1 0.998629 0.001365 6.69E-06 5 0.998499 0.001468 3.34E-05 10 0.998338 0.001596 6.65E-05 50 0.997085 0.002591 0.000324 100 0.995609 0.003764 0.000627 500 0.986459 0.011033 0.002508 1000 0.97915 0.01684 0.00401 Table 4.10: Learning Results of Different α Figure 4.9: Ratio Fluctuation of α i+x i α+x During Learning The prior distributions vary from Dirichlet(0.95, 0.04, 0.01) to Dirichlet(950, 40, 10) so that they have the same starting point before the data input. From Figure 4.9 it is easier to read that the probability is more sensitive to noise in the data when α is small, less sensitive when α is large. As the description in (4.27), the smaller of the α, the quicker the ratio converges to X i X. Table 4.10 tells us that the result of the posterior distribution with α = 1 is much further away from prior than that of α = 1000. This indicates that the distribution with stronger confidence on prior is less easy to be influenced by the data learning. The posterior distribution is Dirichlet(0.998629, 0.001365, 6.69E 06) if α = 1, and Dirichlet(979.15, 16.84, 4.01) if 53
CHAPTER 4. BCRA: RISK PROBABILITY ANALYSIS α = 1000 (Figure 4.10). Furthermore, we can choose the value of α depending on our demand. For instance, α X is a good quotient that we can decide the extent of confidence we have on our prior relative to the data. In our example of α = 1000, α X = 1000 1494 2 3, which implies that our probability with approximately 40% relies on our prior belief, and 60% relies on the external data. Our knowledge of certain risks and data reliability can both influence the choice of the ratio. Figure 4.10: Random Numbers Generated from Dirichlet Distributions Once we know the posterior distribution of the risk, we can generate random numbers from it and estimate the 95% confidence interval by percentile. Figure 4.11 illustrates the samples generated from Dirichlet(95, 4, 1), while for state none the 2.5% percentile is 0.9429, and 97.5% percentile is 0.9972. Thus we have the 95% confidence that probability of earthquake will not exceed magnitude 5 is between 0.9429 and 0.9972. Figure 4.11: Surface of Dirichlet(95, 4, 1), 1000 samples 54
Chapter 5 BCRA: Bayesian Network Analysis A Bayesian network is complete when the structure is determined and conditional probabilities are filled in. However, one is always interested in reasoning about a conditional probability given some evidence, and how probabilities of nodes are interacting in our structure. We will introduce several Bayesian inference algorithms to answer the first question, and implement sensitivity analysis to solve the second problem. 5.1 Bayesian Inference Bayesian inference is a method of inference in which Bayes theorem is used to update the probability estimate for a hypothesis as additional evidence is acquired. In a Bayesian network, we are interested in the conditional probability of some nodes while evidence is set on some other nodes. Generally there are two aspects of Bayesian inference: causal and evidential reasoning. Taking Figure 3.2 as an example, P (BP 2 F ) is a causal reasoning and P (F BP 1) is a evidential reasoning respectively. In a more complex Bayesian network, like Figure 3.4 and Figure 3.6, we cannot easily calculate the conditional probability using Bayes theorem directly. Therefore Bayesian inference algorithms are provided to make a reasonable estimation on certain conditional probabilities. 5.1.1 Inference Algorithms Bayesian inference includes exact inferences and approximate inferences [6]. The former ones are accurate but are not applicable for complicated networks. The latter ones are less accurate but are widely used in networks that are large and complex. Even though the exact algorithm is faster and more accurate, there are networks for which the memory requirements and the updating time may be not acceptable. In these cases, we may decide to sacrifice some precision and choose an approximate algorithm. Famous exact inference algorithms are variable elimination (VE) and clique tree propagation (clustering). The first one eliminates (by integration or summation) the non-observed nonquery variables one by one by distributing the sum over the product. The second algorithm caches the computation so that many variables can be queried at one time and new evidence can be propagated quickly. Both these methods have a complexity that is exponential in the network s treewidth [6]. 55
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS The most common approximate inference algorithms are stochastic sampling algorithms (including EPIS sampling [43], AIS sampling [44], logic sampling [45], backward sampling [46], likelihood sampling [47] and self-importance sampling [48]) and stochastic MCMC (Monte Carlo Markov Chain) simulation methods (Gibbs sampling, Metropolis sampling, etc). Sampling and MCMC algorithms are based on Monte Carlo simulation, in which the model is run through individual trials involving deterministic scenarios. The final result is based on the number of times that individual scenarios were selected in the simulation. Among those sampling algorithms, EPIS sampling, which is called Evidence Pre-propagation Importance Sampling algorithm for Bayesian Networks (EPIS-BN) is thought to be the fastest and most accurate approximate inference algorithm up to now. It uses loopy belief propagation to compute an approximation of the optimal importance function, and then apply ɛ-cutoff heuristic to cut off small probabilities in the importance function [49]. We will compare this algorithm with other stochastic sampling methods in the next section. 5.1.2 Inference Scenario Tests We randomly select 5 risk nodes in the Bayesian network of Figure 3.4, which are Europe earthquake, Power outage, Emergency power, Fire and US storm. All of the marginal probabilities in the Bayesian network are updated by stochastic sampling algorithms in GeNIe. We make a comparison to clustering where probabilities are calculated accurately (risk probabilities are the same as prior). The results are listed below in Table 5.1. State Clustering EPIS AIS Logic Backward Likelihood Self-importance Time (s) 42.7375 100.244 100.448 99.8827 99.6336 102.808 Severe 0.007 0.008 0.002 0.003 0.006 0.007 0 Europe earthquake Limited 0.011 0.009 0.005 0.009 0.014 0.008 0.000187526 Normal 0.982 0.983 0.993 0.988 0.98 0.985 0.999812474 Severe 0.01 0.008 0.012 0.009 0.015 0.004 0.000035602 Power outage Limited 0.04 0.044 0.04 0.046 0.049 0.038 0.018213335 Normal 0.95 0.948 0.948 0.945 0.936 0.958 0.981751064 Severe 0.2 0.199 0.199 0.217 0.188 0.191 0.076082664 Emergency power Limited 0.6 0.601 0.598 0.585 0.617 0.627 0.319883477 Normal 0.2 0.2 0.203 0.198 0.195 0.182 0.604033859 Severe 0.0017 0.002 0.001 0.004 0 0 0 Fire Limited 0.0233 0.022 0.017 0.029 0.021 0.019 0.006941771 Normal 0.975 0.976 0.982 0.967 0.979 0.981 0.993058229 Severe 0.008 0.013 0.009 0.009 0.007 0.009 0.004873505 US storm Limited 0.069 0.061 0.07 0.072 0.056 0.074 0.040786789 Normal 0.923 0.926 0.921 0.919 0.937 0.917 0.954339706 Variance 0 0.00013278 0.00029918 0.00076378 0.00116418 0.00137538 0.261513922 Table 5.1: Sampling Algorithm Efficiency Obviously, EPIS sampling is thought to be a more efficient algorithm since its calculation speed dominates all other algorithms, of which times are more or less the same. For the accuracy of inference, EPIS is still the best choice, where its variance is the smallest (taking clustering probability calculation results as the real values), followed by AIS, logic, backward, likelihood and self-importance respectively. One can conclude that EPIS should be considered first when we apply Bayesian inference in GeNIe. How does the conditional probability change if evidence is changed? Here we have 24 business processes in the model, and the probability that one has limited power outage in a year is 0.04. If we set evidence on the limited state of power outage, then we consider P (BusinessP rocess = F ail P oweroutage = Limited ). In Table 5.2 it is clear to see that the conditional probability of BP20 is about 5 times the original if we have e = Limited on power outage, which indicates that it is influenced most by the power outage. Similarly, for all the 19 risk nodes, we set evidence on BP5, a business process that has a 56
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS P (X=F E=e) P (X=F ) Business Process P (X = F ) P (X = F E = e) BP21 0.026 0.126 4.846154 BP8 0.168 0.278 1.654762 BP5 0.08 0.364 4.55 BP16 0.109 0.318 2.917431 BP3 0.134 0.149 1.11194 BP18 0.135 0.157 1.162963 BP9 0.039 0.127 3.25641 BP2 0.129 0.144 1.116279 BP11 0.02 0.073 3.65 BP6 0.019 0.079 4.157895 BP4 0.157 0.235 1.496815 BP13 0.082 0.338 4.121951 BP20 0.024 0.12 5 BP10 0.039 0.124 3.179487 BP22 0.045 0.155 3.444444 BP23 0.029 0.119 4.103448 BP15 0.086 0.336 3.906977 BP14 0.02 0.081 4.05 BP7 0.061 0.207 3.393443 BP12 0.039 0.107 2.74359 BP1 0.124 0.144 1.16129 BP24 0.029 0.121 4.172414 BP19 0.009 0.02 2.222222 BP11 0.116 0.129 1.112069 Table 5.2: Inference Results: Evidence on Risk 0.091 probability to fail (e= fail on BP5 ). The prior probabilities of Limited for all risks are given on the second column, while the conditional probability given that BP5 is failed is on the third column (Table 5.3). The last column shows the changes from prior probability to conditional. Evidence on BP5 has enhanced the probabilities of some risk, but it also reduced some others at the same time. Europe earthquake is thought to be more sensitive than any other risks, while power outage is most likely to cause the failure of BP5. We can also set evidence on more than one nodes (more observations), which give us a better estimation on the scenarios we are interested in. However, the inference algorithms may give us slightly different results every time in the same scenario. In this case, it is better for us to execute them multiple times and take the average. The value we acquire will converge to a value that is close to real value, which give us a reliable result. 5.2 Sensitivity Analysis Sensitivity analysis is a general technique for studying the effects of inaccuracies in the parameters of a mathematical model. The analysis aims at systematically varying the values of one or more parameters and recording the output for each value combination. Different types of sensitivity analysis are distinguished, depending on the number of parameters that are varied in the analysis. The most common type of sensitivity analysis performed in practice is a one-way sensitivity analysis in which a single parameter is varied. Higher-order sensitivity analyses are hardly conducted in practice, not just because of their impractically high runtime, 57
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS P (X=F E=e) P (X=F ) Risk P (X = F ) P (X = F E = e) Power outage 0.04 0.17968756 4.492189 Fire 0.0233 0.1542643 6.620785408 Pandemic 0.032 0.063639993 1.988749781 Terrorism 0.007 0.047119888 6.731412571 Europe earthquake 0.011 0.097942934 8.903903091 Europe flood 0.008 0.057699332 7.2124165 Europe storm 0.02 0.1102842 5.51421 Us earthquake 0.01 0.027126457 2.7126457 Us flood 0.018 0.015517554 0.862086333 Us storm 0.069 0.067282889 0.975114333 Asia earthquake 0.007 0.005 0.714285714 Asia flood 0.007 0.005 0.714285714 Asia storm 0.07 0.068 0.971428571 India earthquake 0.027 0.053936434 1.997645704 India flood 0.04 0.032 0.8 India storm 0.07 0.058295478 0.832792543 Africa earthquake 0.115 0.092441263 0.80383707 Africa flood 0.015 0.015224311 1.014954067 Africa storm 0.07 0.057506341 0.821519157 Table 5.3: Inference Results: Evidence on Business Process but also because of the complex interactions among parameters [7]. When applied to a Bayesian network, sensitivity analysis entails studying the effects of varying the network s parameter probabilities on the computed output of interest. A one-way analysis amounts to establishing, for a specific output probability of interest, the function that expresses this probability in terms of the parameter being varied [50]. We will see how this relationship looks like and what it indicates later on. 5.2.1 Functional Relationship Consider G is a BN structure and P its joint probability, we define a node of interest X r X(G), where X(G) is the set of nodes in G. O(G) is a set of observed nodes. Combining the two sets together, we obtain a sensitivity set S(X r, O), which is the set of nodes of interest or observed. The one-way sensitivity analysis of a Bayesian network can be restricted to the conditional probabilities of the nodes in S(X r, O). Proposition 5.2.1. (One-way Sensitivity Analysis) Let G be a BN structure and P its joint probability. Consider O(G) X(G) be the set of observed nodes in G, and let o denote the corresponding observation. Let X r X(G) be the network s nodes of interest and let S(X r, O) be the sensitivity set for X r given O. Then, for any value x r of X r, we have the sensitivity function is given by P (x r o) = f P (xr o)(x) = ax + b cx + d or in another form P (x r o) = f P (xr o)(x) = a x + b x + c where x [0, 1], f(x) [0, 1]. For every conditional probability x = P (x s P a X (x s )) of every node X s S(X r, O), where a, b, c, d are constants dependent of x s and P a X (x s ). Specially, if 58
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS σ (X s ) O =, i.e, nodes of interest are not in the set of observed nodes, then we have P (x r o) = f P (xr o)(x) = ax + b where a, b are constants dependent of x s and P a X (x s ). Proof. By Bayes theorem, the probability of interest equals P (x r o) = P (x r, o) P (o) By the property of marginalization and factorization, P (x r, o) = P (X i P a X (X i )) and P (o) = x s X(G)\{x r,o} x s X(G)\{o} i P (X i P a X (X i )) Therefore, both the numerator and the denominator can be written as a sum of products of conditional probabilities. By separating, in these sums, the terms that specify the conditional probability x under study and those that do not, it is readily seen that P (x r, o) as well as P (o) relate linearly to x. Given the sensitivity function in Proposition 5.2.1, it is easy to obtain the sensitivity value given certain x by the absolute value of first derivative of sensitivity function f P (x r o) (x). It identifies the proportion relationship between P (x r o) and x. The initial assessment of sensitivity value is f P (x (x r o) 0), where x 0 is the original input value of x. In hyperbola form, the sensitivity function takes the form of a part of a hyperbola branch, where the sensitivity value is expressed as or i f P bc ad (x r o) (x) = (cx + d) 2 (5.1) f P (x r o) (x) = a c b (x + c ) 2 (5.2) If f P (x (x r o) 0) > 0, then the output probability is sensitive to deviations of x from x 0. Moreover, the larger the sensitivity value, the stronger the effect of deviations from x 0 can be. The vertex (x v, f P (xr o)(x v )) of the hyperbola is the threshold that separates the sensitivity function into two parts: one part are sensitivity values are stable; another part are sensitivity values that change dramatically. It is defined when f P (x r o) (x) = 1, where or x v = d ± a d b c c (5.3) x v = c ± a c b (5.4) If the assessment x 0 for x is close to x v, the sensitivity value of x 0 is not a good approximation of the effect of parameter variation. This indicates that the further x 0 lies from x v, the better the sensitivity value describes the effect of deviations from x 0. In linear form, the sensitivity value is a constant which is equal to a. Sensitivity functions can only be understood from domain knowledge. We can only explain 59
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS such interactions in a mathematical form but specific knowledge is required if we need further investigations. We illustrate the sensitivity function stated in Proposition 5.2.1 by a simple example. Consider the sub-network Business Process 1 Building 1 Business Process 2 in Figure 3.2. The node of interest is Business Process 1, and we assume that the value Fail has been observed for the node Business Process 2. We consider the probability of interest P (normal BP 1 fail BP 2), and the conditional probability x = P (fail BP 2 normal B1). Then we have P (normal BP 1 fail BP 2) = The numerator in this fraction equals P (normal BP 1, failbp 2) P (fail BP 2) (5.5) P (normal BP 1, fail BP 2) = B1 P (normal BP 1 B1) P (fail BP 2 B1) P (B1) = P (normal BP 1 normal B1) P (fail BP 2 normal B1) P (normal B1) + P (normal BP 1 fail B1) P (fail BP 2 fail B1) P (fail B1) = ax + b (5.6) where a = P (normal BP 1 normal B1) P (normal B1), b = P (normal BP 1 fail B1) P (fail BP 2 fail B1) P (fail B1) The denominator equals P (fail BP 2) = P (normal BP 1 B1) P (fail BP 2 B1) P (B1) B1,BP 1 = P (fail BP 2 B1) P (B1) B1 = P (fail BP 2 normal B1) P (normal B1) + P (fail BP 2 fail B1) P (fail B1) = cx + d (5.7) where c = P (normal B1), d = P (fail BP 2 fail B1) P (fail B1) If we let x = P (normal BP 1 normal B1), we immediately have a = P (fail BP 2 normal B1) P (normal B1), b = P (normal BP 1 fail B1) P (fail BP 2 fail B1) P (fail B1) and P (fail BP 2) = P (normal BP 1 B1) P (fail BP 2 B1) P (B1) B1,BP 1 = P (fail BP 2 B1) P (B1) B1 = P (fail BP 2 normal B1) P (normal B1) + P (fail BP 2 fail B1) P (fail B1) = c (5.8) 60
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS showing that P (fail BP 2) does not depend on x. We conclude that the probability of interest relates linearly to the conditional probability under study. If none of the nodes in a Bayesian network are observed, we have a sensitivity set S(X r, ). Proposition 5.2.1 indicates that the network s probability of interest relates linearly to every conditional probability of every node in this case [51]. In a complex network, it is troublesome to express the values of constants in terms of conditional probabilities. However, the constants can be determined more efficiently by solving linear equations. Three constants a, b and c can be obtained if we know three pairs of conditional probabilities under study. We will see an application in the next section. 5.2.2 Application We study the sensitivity analysis under the model of Figure 3.6 to see whether the property described in the foregoing fits our Bayesian network. There are 5 process nodes chosen as nodes of interest, along with the risk node Hack. The original input probability of Hack is {0.5, 0.5}. In order to see how a hack attack influences the security of these processes, we compute the conditional probabilities of processes given evidence when we change the probability of hack attack by certain percentages respectively. Table 5.4 shows the results of the analysis. Hack Prob %Prob Change P121 P124 P232 P236 P234 0.00 100% 0.168786-0.41169 0.152071-0.40642 0.140606-0.40233 0.120665-0.39466 0.083264-0.38 0.10 80% 0.192408-0.32935 0.172895-0.32514 0.159536-0.32187 0.136399-0.31573 0.093471-0.304 0.25 50% 0.227842-0.20584 0.204132-0.20321 0.187932-0.20117 0.159999-0.19733 0.108781-0.19 0.35 30% 0.251465-0.12351 0.224957-0.12193 0.206863-0.1207 0.175733-0.1184 0.118987-0.114 0.40 20% 0.263276-0.08234 0.235369-0.08128 0.216328-0.08047 0.1836-0.07893 0.124091-0.076 0.45 10% 0.275087-0.04117 0.245782-0.04064 0.225793-0.04023 0.191466-0.03947 0.129194-0.038 0.50 0% 0.2869 0 0.25619 0 0.23526 0 0.19933 0 0.1343 0 0.55 10% 0.29871 0.041169 0.266606 0.040642 0.244723 0.040233 0.2072 0.039466 0.139401 0.038 0.60 20% 0.310521 0.082338 0.277018 0.081285 0.254189 0.080467 0.215067 0.078931 0.144504 0.076 0.65 30% 0.322332 0.123506 0.287431 0.121927 0.263654 0.1207 0.222934 0.118397 0.149607 0.114001 0.75 50% 0.345955 0.205844 0.308255 0.203211 0.282584 0.201166 0.238667 0.197328 0.159814 0.190001 0.90 80% 0.381389 0.32935 0.339492 0.325138 0.31098 0.321866 0.262268 0.315725 0.175124 0.304002 1.00 100% 0.405011 0.411688 0.360317 0.406423 0.32991 0.402333 0.278001 0.394657 0.185331 0.380002 Slope(a) 0.118113 0.104123 0.094652 0.078668 0.051033 Intercept(b) 0.168786 0.152071 0.140606 0.120665 0.083264 Table 5.4: Sensitivity between Hack and Processes The probability of interest is P (P rocess Hack), and x = P (Hack). Here no observation is given, thus the two probabilities are linear related. The most sensitive probability is P (P 121 Hack), where a = 0.118113 and b = 0.168786. All the 5 sensitivity function curves are displayed in Figure 5.1. Similarly, the relationship between Power outage and processes is also linear. However, the values of slope in Table 5.5 are negative, while they in Table 5.4 are positive. This implies that these processes are positive linear related to Hack but negative linear related to Power outage. The absolute values of the slopes in Table 5.5 are larger than the corresponding items in Table 5.4, showing that the probabilities of processes are more sensitive to probability changes in Power outage than in Hack. We now turn to see the interactions between conditional probability of one process given an observation on another process and an arbitrage conditional probability in the model. Suppose we know P124 is Fail, and P232 is unknown. The probability of interest is f(x) = P (fail P 232 fail P 124). As Proposition 5.2.1 described, there are two scenarios we 61
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS Figure 5.1: P (P rocess Hack) vs P (Hack) Power Prob %Prob Change P121 P124 P232 P236 P234 0.9405 1% 0.2888 0.006628 0.2579 0.00666 0.2368 0.006554 0.2007 0.006857 0.1352 0.006722 0.95 0% 0.2869 0 0.25619 0 0.23526 0 0.19933 0 0.1343 0 0.9595 1% 0.285-0.00662 0.2545-0.00661 0.2337-0.00662 0.198-0.00669 0.1334-0.00668 Slope -0.19015-0.17061-0.15418-0.13668-0.09027 Table 5.5: Sensitivity between Hack and Processes consider Nodes of interest are in the set of observed nodes, then we set x = P (fail P 124 yes Hack) Nodes of interest are in the set of observed nodes, then we set x = P (fail P 232 yes Hack) As a result, in GeNIe, we compute three pairs of values for the first scenario as follow: (0.36, 0.823) (x, f(x)) = (0.29, 0.829) (0.15, 0.837) From these values, we obtain the three linear equations 0.36a +b 0.36+c = 0.823 0.29a +b 0.29+c = 0.829 0.15a +b 0.15+c = 0.837 Solving this equations gives a = 0.865, b = 0.657, c = 0.78. Here x 0 = 0.36, and the vertex x v is 0.64696. The function is depicted in Figure 5.2. We note that f(x) shows a high sensitivity to x if x 0.64696. We conclude that input of x if x 0.64696 easily causes large effect on f(x) since a little inaccuracy in its value could yield a large move in the value of f(x). 62
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS Figure 5.2: P (fail P 124 yes Hack) vs P (fail P 232 fail P 124) For the second scenario, we only have to use two pairs of values { (0.33, 0.823) (x, f(x)) = (0.25, 0.63) Two linear equations are built to solve { 0.33a + b = 0.823 0.25a + b = 0.63 We finally get a = 2.423, b = 0.0234. Thus we get a linear function (Figure 5.3). Figure 5.3: P (fail P 232 yes Hack) vs P (fail P 232 fail P 124) We repeat the procedure for the other three processes with regard to P124 in the first situation. Other results are shown in Table 5.6 and Figure 5.4. Note that the sensitivity threshold for P121 is on the right branch (for x < x v, f P (x r o) (x) > 1). One probability of interest, for example, P (f ail P 124 yes Hack) is always bounded from left and right simultaneously. If we only consider the range that P (fail P 124 yes Hack) does not have a large sensitivity value with regard to the other four processes, we should let x 0 be varied within (0.17645, 0.59217), or even a smaller interval. The interval is constituted by the overlap of four open intervals. 63
CHAPTER 5. BCRA: BAYESIAN NETWORK ANALYSIS P 1 P124 P 2 P121 P236 P234 a 0.99876 0.7792 0.559 b -0.17449-0.66271-0.40074 c -0.17471-0.906-0.78 vertex 0.17654 0.69805 0.59217 Table 5.6: Sensitivity analysis, nodes of interest are in the set of observed nodes Figure 5.4: (a) P (fail P 124 yes Hack) vs P (fail P 121 fail P 124) (b) P (fail P 124 yes Hack) vs P (fail P 236 fail P 124) (c) P (fail P 124 yes Hack) vs P (fail P 234 fail P 124) In overall, one-way sensitivity analysis can track the effects of changing parameter values, discretization of states, arc addition or removal and the robustness of most likely output values. It is sufficient to answer most of the simple sensitivity questions if we do not seek for complex interactions. 64
Chapter 6 BCRA: Value at Risk Value at Risk (VaR) is a widely used tool for risk assessment. It is generally described as the unexpected loss at a certain risk tolerance level, though it is sometimes used to describe the total risk exposure at that level. In statistical analysis, VaR is obtained by an aggregated loss distribution, which is the combination of a frequency distribution and a loss distribution. In BCM, the probability of risk is usually very small such that the frequency distribution is defined by the event occurrence distribution. In practice, VaR helps to estimate how much capital should be reserved for potential risk every year. 6.1 Definition Usually, VaR is not used in BCM, but it is a decent indicator of loss if we want to estimate the capital to cover unexpected losses under appropriate level in a bank. Whereas expected losses can be described as the average losses that occur in one year, unexpected losses are deviations from the average that may put bank s stability at risk. There exist loss events with low frequency but high impact; on the other hand, there are also high frequency and low severity events. In BCM, we are more interested in the first situation, in which the aggregated loss distribution is likely to have heavy tails since extreme losses can possible occur. We have introduced the distribution fitting for events severity using the event dataset in Chapter 4 where a specific threshold was determined to identify destructive events that may cause considerable damage and significant losses. Similarly, loss data can be fitted into an appropriate probability distribution. If destructive events are not manifest, the loss amount would be zero. If disaster damages are caused, losses are incurred and their amounts follow a certain distribution. The combination produces an aggregated loss distribution from which we obtain the VaR from the percentile that is determined by our confidence level [52]. Definition 6.1.1. (VaR) Given a confidence level θ (0, 1), the VaR of the loss at the confidence level θ is given by the smallest number l such that the probability that the loss X exceeds l is at most (1 θ). If X is the loss of an event, then V ar θ (X) is the level of the θ-th percentile, i.e. V ar θ (X) = inf {l R : P (X > l) 1 θ} = inf{l R : F X (l) θ} We now see how to obtain VaR in practice. Suppose we have an event probability distribution F that with density f(n) (frequency if in discrete version), and G is the loss distribution with 65
CHAPTER 6. BCRA: VALUE AT RISK g(x n) as the loss density given n. The aggregate loss density is given by h(x) = g(x n)f(n)dn (= g(x n)f(n), discrete version) (6.1) n This expression is called a mixture, which is difficult to be solved directly. However, some statistical methods such as Monte Carlo simulation can be used to find the estimation for VaR. By definition, we obtain that 1 θ = l h(x)dx = H(l) (6.2) VaR is obtained by taking the percentile of the aggregate loss distribution at the desired confidence level l = V ar(x) = q(1 θ) = H 1 (1 θ) (6.3) For any fitted probability distribution we choose, random samples are generated from F and G. If the simulated number generated by F exceeds a certain threshold, loss incurs so that a loss number is generated by G; otherwise the loss number is denoted zero. Therefore, the aggregated loss sample is created, with plenty of zero loss and a few non-zero loss amounts. We immediately acquired an aggregated loss distribution in which (1 θ) percentile is the VaR of interest (Figure 6.1). Figure 6.1: Aggregated Loss Distribution 6.2 Application We use the loss dataset from SAS Institute Inc. containing Operational Risk loss events that have become publicly available through various media (newspaper, internet) over a certain time span. The loss amounts are recorded in dollars, thus we have to convert them into euros after the simulations. However, some loss data were recorded when the euro had not been issued. Furthermore, the exchange rate USD:EUR is fluctuating over time which may cause some inaccuracies in VaR. The biases would be weakened if the exchange rate model is included, which is out of scope for this thesis. To reduce the currency confusion as simple as possible, we take the exchange rate as USD:EUR=1.36:1 in our model, an average exchange rate in June, 2014. We focus on the dataset of earthquake losses in SAS, and fit the data to a Pareto distribution using MLE. The optimized fitting parameters are α = 1.4374, x m = 2 (Figure 6.2). In addition, K-S statistic is 0.1842. In Section 4.2.3, we have the fitted distribution for Europe earthquake 66
CHAPTER 6. BCRA: VALUE AT RISK to also a Pareto distribution with α = 6.511, x m = 2.4. Hence it is easy to generate two sets of 1,000,000 simulated data from the fitted distribution respectively (Figure 6.3). Figure 6.2: Loss Distribution in Log-log Scale Figure 6.3: Simulated Loss Data The earthquake may cause losses if and only if the magnitude of earthquake exceeds 5. Thus we generate a sample of aggregated loss by this threshold (Figure 6.4, 1,000,000 numbers). In order to ensure the accuracy of the percentiles, we replicate this simulation 1000 times. Therefore we obtain the output results in Table 6.1 and Table 6.2.The outlier, say, extremely large values, can make us overestimate VaR when we set a high confidence level if there are few data points. To obtain a better result, we may want to exclude these values in our analysis. The average loss amount (expected loss) is e19,685,480, taking the median of all the means of losses [8]. We have the result of interest in three aspects: 99.5%, 99.9% and 99.95% VaR, and the probability of an annual loss exceeding e10 million, e30 million, e100million and e300 million respectively. The unexpected loss is the difference between VaR and expected loss. Therefore, if the bank wants to cover average losses incurred in its usual course of business due to damage 67
CHAPTER 6. BCRA: VALUE AT RISK Figure 6.4: Aggregated Loss Distribution: (a) Small Loss Amounts (b) Large Loss Amounts caused by earthquake in Europe, it should keep e19,685,480 as (one-year) provisions. However, if the bank also wants to protect the business in case of potential severe losses, it should keep an additional e15,150,441 in capital reserves (to be adequately covered at the 99.5% level). Furthermore, if we consider the VaR for all other risks within BCM, the aggregated VaR is the estimated amount of capital that the bank should reserve for potential loss in a year. This does offer a valuable estimate for BCM. Percentile Losses Provisions and Capital Confidence Level VaR Expected loss Unexpected loss Provision Capital requirement 99.50% 34,835,922 19,685,480 15,150,441 19,685,480 15,150,441 99.90% 155,221,608 19,685,480 135,536,128 19,685,480 135,536,128 99.95% 250,197,453 19,685,480 230,511,973 19,685,480 230,511,973 Table 6.1: VaR of Different Confidence Levels Loss amount (em) Probability 10 1.0819% 30 0.5626% 100 0.1746% 300 0.0376% Table 6.2: Probability that Loss Exceeds Certain Threshold As we have seen, loss simulation results are transparent and easy to interpret. Percentiles at different confidence levels are also very easy to visualize [53]. These advantages make VaR calculations feasible in BCM. 68
Chapter 7 Conclusion and Suggestions 7.1 Conclusion The purpose of this project was to introduce the methodology of probabilistic analysis within Business Continuity Management. It aimed at investigating the business structure in a bank and finding an estimation of conditional probabilities for associated risks and the financial impacts that may influence the regular operation of business processes and their sub-processes. The methodology is composed of querying a Business Graph Database, calculating network indices of Business Structure, building a Bayesian network and estimating Value at Risk. Popular statistical approaches such as distribution fitting, the bootstrap method and Monte Carlo simulation were used to support the analysis. Most of the work was implemented in Excel, Matlab, Neo4j, Gephi and GeNIe. First of all, a Business Graph database using existing business structure of ABN AMRO was built in Neo4j, in which we could find the business components of interest and query through the relation paths corresponding to those components. The structure could be imported into Gephi so that network metrics like centrality can be computed. Three major centralities were provided to give an overview of criticality and vulnerability among each node group. Criticality is majorly associated with in-degree centrality, betweenness centrality and eigenvector centrality, while vulnerability was determined by out-degree centrality. The second model was a Bayesian network that is built in GeNIe. The model consists of probabilistic nodes of which the conditional probability only depends on their parent nodes. By independencies, any joint probability that represents a network or sub-network can be written as a product of conditional probabilities that represent smaller network structures. To reduce the computational time and efforts of filling in conditional probability tables, Noisy-OR or Noisy-MAX CPD can replace common tabular CPD to lower the number of parameters. We use two common approaches to estimate the risk probability in a Bayesian network. The first one is distribution fitting, where collected risk data is fitted into a chosen distribution. Then we selected the one that matches the real problems and has a small K-S statistic. Finally, we find the target probability based on the thresholds. One of the most appropriate models is the Pareto distribution, where the cumulative function is linear in the log-log scale so that it would not underestimate the tail probability. Resampling methods, such as the bootstrap, were applied to build a confidence interval for deviations of the parameters. Another approach is Bayesian estimation using the EM algorithm. Taking the Dirichlet distribution as a prior, the 69
CHAPTER 7. CONCLUSION AND SUGGESTIONS posterior distribution is also Dirichlet distributed so that the probability of certain state can be updated and reflected on its parameters. The strength of the prior represents the confidence of the prior that determines the sensitivity of probability during the data learning. If some observations are provided and we want to know some conditional probabilities of interest given evidence, Bayesian inference can provide us with some efficient algorithms to answer the questions. Exact inference algorithms are accurate but limited by calculation speed. By testing, EPIS algorithm was the most powerful approximate inference algorithm that is suitable for complex networks, which has the minimum computational time and smallest variance on probability estimations. To validate the inaccuracy of probabilities in Bayesian networks, sensitivity analysis helps to investigate the interactions among conditional probabilities of interest. Their relationship is usually hyperbolic, which simplifies to linear if the nodes of interest are not in the set of observed nodes. A large sensitivity value implies greater possibility of inaccuracy in the output probability. One of the most useful applications of the risk probability estimation is Value at Risk. It is the maximum loss under a confidence level over a time span. Using Monte Carlo simulations, the aggregated loss distribution can be generated by random samples from event and loss distribution. Therefore VaR can be obtained by taking the certain percentile of the aggregated loss distribution. The capital required for potential loss is the capital required to cover the expected losses and unexpected losses under VaR of interest. To conclude, this thesis is a documentation of BCM in a quantitative perspective. The associated model can be put into practice, but further improvement is still necessary (See Section 7.2). 7.2 Suggestions As stated in the conclusion, the probabilistic analysis in BCM gives us a fundamental methodology of quantitative risk assessments. Moreover, there is scope for improvement. Some details are unable to be researched further up to now because of time constraints. Therefore, we summarize some major items below that deserve to be investigated in future research. 7.2.1 Bayesian Network In Chapter 3, we built a Bayesian network based on the business structure of a bank. However, the model was simplified because of lacking information of buildings and risks in details so that it is not perfect. An example is the dependencies between risks and buildings. Fire probability should vary among buildings which have different structure and floor area. The detail data associated with buildings is not available. Thus there is only one Fire node that connects to all the buildings of the bank. Buildings in different cities in the Netherlands should have different probability of earthquake, flood or storm. For example, Groningen, a northern city in the Netherlands, usually has more earthquakes than cities that are in the west. Under this circumstance, buildings in the cities that are hardly affected by earthquakes (f.i. Amsterdam, has probability below our concern) are not considered to be influenced by 70
CHAPTER 7. CONCLUSION AND SUGGESTIONS occurrence of earthquake. Thus we may not consider the directly links between earthquake and those buildings in Bayesian network. Furthermore, the conditional probabilities require further improvement and validation. The parameters should be adjusted to match the real world if the dependencies are fully known or estimated by experts. For instance, parameters of Fail in Noisy-OR CPD of processes are all 0.8, which should be varied in future works. In addition, the risks and mitigation methods included in the model are not specified to individual building. They are set as global nodes which have equivalent impacts on each building node. 7.2.2 Probability Analysis We chose the Pareto distribution as a common example of distribution fitting for risks in BCM. Nonetheless, the distribution we used is not unique. Some other distributions that have good fitting quality and heavy tails are also suitable for probability estimations. Generally, the tradeoff may depend on the data quality and experience knowledge on the risks of interest. Moreover, risk data are usually not easy to obtain. This is the reason why we only focus on analysis of a few types of risk and set probability of other risks by guessing. In addition, this analysis is based on the first level of data. It is unrealistic to evaluate all the water and wind stations in a country. If further data information is provided by experts on certain domains, the model will become more accurate. 7.2.3 VaR Figure 7.1: VaR, CVaR Deviation Similarly, loss data which we use can only give us a rough estimate of the loss probability. A better way is to incorporate SAS data with internal data. However, the records of internal failure are few and incomplete. Although Monte Carlo simulations solve this problem to some extent, VaR does not consider some scenarios, especially extremely large losses. An outlier, say, an extreme large loss, can easily cause an overestimate for VaR. More advanced approaches, such as CVaR (Conditional Value at Risk), account for losses larger than VaR. This is a complementary 71
CHAPTER 7. CONCLUSION AND SUGGESTIONS method to VaR. It is the expected loss which is equal to or exceeds VaR CV ar θ (X) = E[X X V ar θ (X)] (7.1) This gives us an estimation of potential losses that are larger than VaR. From CVaR, it is possible to evaluate the extent of losses which are larger than VaR. 72
Appendix A Maximum Likelihood Estimation of α Consider the Pareto distribution p(x) = α x m ( x x m ) (α+1) Give a set of data {x i : i = 1,..., n}, we have a likelihood function (A1) P (x α) = n p(x i ) = i=1 n i=1 α x m ( x i x m ) (α+1) We want to know in the Bayesian statistics context the probability of a particular value of given the observed x i, which is given by Bayes theorem thus p(α x) = p(x α) p(α) p(x) The prior probability of the data p(x) is fixed: it is 1 for the set of observations we made and zero for every other. And we usually assume, in the absence of any information to the contrary, the prior probability of the exponent p(α) is uniform, i.e. a constant. Thus p(α x) p(x α). For convenience we typically work with the logarithm of p(α x), which, up to an additive constant, is equal to the logl of the likelihood, given by L = ln ( p(x α) ) = n i=1 [ln(α) ln(x m ) (α + 1)ln( x i x m )] = nln(α) nln(x m ) (α + 1) n i=1 ln( x i x m ) Now we calculate the most likely value of α by maximizing the likelihood with respect to α, which is the same as maximizing the log likelihood, since the logarithm is a monotonic increasing function. Setting L α = 0, we find n or α n i=1 ln( x i x m ) = 0 (A2) (A3) (A4) (A5) [ α = n ln( x ] 1 i ) (A6) x m i 73
Appendix B Sample Results of Other Risks In Chapter 4 we showed that the Pareto distribution is a better probability distribution for fitting than others if a certain lower limit exists. However, this is not always the case for other risk data. In this appendix we show some results of other datasets, including water level and wind speed. In most scenarios the Pareto distribution can give us a decent result, but sometimes other distributions may be a better choice (See Generalized Pareto fitting in B.1). Generally in practice, Pareto distribution is still one of the priority selections for data fitting in calculating extreme value probability. 74
APPENDIX B. SAMPLE RESULTS OF OTHER RISKS B.1 Wind Speed Data Data source: KNMI Date range: 01/1980-05/2014 Wind speed is in m/s and thresholds are based on Beaufort scale. Figure B.1: Netherlands Wind Speed Data and Fitted Density Figure B.2: Pareto Fitting with Best Fit but Too Large x m =17.3 (close to 20) Figure B.3: Pareto Fitting with Good Fit and Appropriate x m =10.3 (far from 20) 75
APPENDIX B. SAMPLE RESULTS OF OTHER RISKS B.2 Water Level Data Data source: KNMI Date range: 01/1980-12/2009 Water level is in cm or cft/s and thresholds are based on individual reference documents. arnhm clubbg denhdr devtr eemshvn ijmuiden katerveer roermbyn rotterdam westkapelle Pareto 0.285 0.2956 0.075 0.1073 0.1473 0.1416 0.0415 0.078 0.1804 0.1416 Generalized Pareto 0.1521 0.1393 0.0368 0.0759 0.0381 0.0607 0.0861 0.2443 0.032 0.0607 Table B.1: Pareto and Generalized Pareto K-S statistics in 10 water stations Figure B.4: CDF of Arnhem and Culemborg brug Figure B.5: CDF of Den Helder and Deventer 76
APPENDIX B. SAMPLE RESULTS OF OTHER RISKS Figure B.6: CDF of Eemshaven and IJmuiden Figure B.7: CDF of Katerveer and Roermond boven Figure B.8: CDF of Rotterdam and Westkapelle 77
Bibliography [1] E. Smit. Business Continuity Management Introductory Course. 2014. [2] Wikipedia. Business Continuity Planning. 2013. [3] K. Miller. Business Impact Analysis (BIA), informal paper. 2005. [4] P. Sikdar. Alternate approaches to Business Impact Analysis. Information Security Journal: A Global Perspective, (20:3):128 134, May 2011. [5] H. Chen and A. Pollino. Good practice in Bayesian network modeling. Environmental Modeling & Software, (37):134 145, April 2012. [6] D. Koller and N. Friedman. Probabilistic Graphical Models. MIT Press, 2009. [7] M.H. Coupe and L.C. van der Gaag. Properties of sensitivity analysis of bayesian networks. Annals of Mathematics and Artificial Intelligence, 36:323 356, December 2002. [8] E. Navarrete. Practical calculation of expected and unexpected losses in operational risk by simulation methods. Banca & Finanzas: Documentos de Trabajo, I(1):1 12, October 2006. [9] J. Webber I. Robinson and E. Eifrem. Graph Databases. Wiley, 2013. [10] Neo4j Team. Neo4j Mannual 2.0.1, 2014. [11] Neo Technology. Neo4j Cypher Refcard 2.0, 2014. [12] M.E.J. Newman. Networks: An Introduction. Oxford University Press, 2010. [13] D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, 2010. [14] P. Bonacich. Power and centrality: a family of measures. American Journal of Sociology, 92(5):1170 1182, March 1987. [15] C. Freeman. A set of measures of centrality based upon betweenness. Sociometry, 40(1):35 41, March 1977. [16] P. Bonacich. Simultaneous group and individual centralities. Social Networks, 13(2):155 168, June 1991. [17] Wikipedia. Eigenvalues and eigenvectors. 2014. [18] Geghi Team. Gephi Tutorial, 2010. [19] T. Venturini M. Bastian M. Jacomy, S. Heymann. ForceAtlas2, a graph layout algorithm for handy network visualization. 2011. 78
BIBLIOGRAPHY [20] Wikipedia. Bayes Theorem. 2014. [21] M. Takikawa and B.D Ambrosio. Multiplicative factorization of Noisy-Max. Morgan Kaufmann Publishers Inc., pages 622 630, 1999. [22] M. Henrion. Practical issues in constructing a Bayes Belief Network. CoRR, March 1987. [23] F.J. Diez. Parameter adjustment in Bayes networks. Morgan Kaufmann, 1993. [24] A. Zagorecki and M. Druzdzel. Knowledge engineering for bayesian networks: How common are Noisy-MAX distributions in practice? IOS, pages 482 486, August 2006. [25] University of Pittsburgh Decision Systems Laboratory. Genie documentation. 2010. [26] Koninklijk Nederlands Meteorologisch Instituut, website. 2014. [27] U.S. Geological Survey, website. 2014. [28] Met Office. National meteorological library and archive fact sheet 6 - the Beaufort scale. 2010. [29] S. Collins N. Botambekov E. Splitt, M. Lazarus and P. Roeder. Probability distributions and threshold selection for Monte Carlo type tropical cyclone wind speed forecasts. American Meteorological Society, 2002. [30] D.Malamud and L.Turcotte. The applicability of power-law frequency statistics to floods. Journal of Hydrology, 322:168 180, 2006. [31] Statistics Netherlands. 2014. [32] A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998. [33] D. Revuz and M. Yor. Continuous Martingales and Brownian Motion. Springer, 1999. [34] M. Kendall and A. Stuart. The Advanced Theory of Statistics. Wiley, 1961. [35] M.E.J. Newman. Pareto distributions and Zipf s law. Contemporary Physics, (28):323 351, 2006. [36] C.Arnold. Pareto distributions. Encyclopedia of Statistical Sciences, pages 1 8, 1983. [37] B. Gutenburg and F. Richter. Frequency of earthquakes in California. 1944. [38] B. van Es and H. Putter. The Bootstrap, Lecture Notes UvA. 2011. [39] B. Efron. Better bootstrap confidence interval. American Statistical Association, 82(397):171 185, 1987. [40] N. Balakrishman S. Kotz and N.L. Johnson. Continuous Multivariate Distributions. Wiley, 2000. [41] N.M. Laird A.P. Dempster and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Royal Statistical Society. Series B (Methodological), 39(1):1 38, 1977. [42] J.A. Little and B. Rubin. Statistical Analysis with Missing Data (2nd Edition). Wiley, 2002. [43] C. Yuan and J. Druzdzel. An Importance Sampling algorithm based on evidence prepropagation. pages 624 631, 2003. 79
BIBLIOGRAPHY [44] C. Jian and J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian Networks. Articial Intelligence Research, 13:155 188, 2000. [45] M. Henrion. Propagating uncertainty in Bayesian Networks by Probabilistic Logic Sampling. pages 149 163, 1986. [46] R. Fung and B.D. Favero. Backward simulation in Bayesian Networks. pages 227 234, 1994. [47] M. Fung and K. Chang. Weighing and integrating evidence for stochastic simulation in Bayesian Networks. pages 209 220, 1990. [48] R.D. Shachter and M.A. Peot. Simulation approaches to general probabilistic inference on Belief Networks. pages 221 234, 1989. [49] J. Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Proceedings of the American Association of Artificial Intelligence National Conference on AI, pages 133 136, 1982. [50] S. Renooij L.C. van der Gaag and M.H. Coupe. Sensitivity analysis of probabilistic networks. pages 103 124, 2007. [51] M. van Geffen L.C. van der Gaag, R. Kuijper and J.L.Vermeulen. Towards uncertainty analysis. pages 445 476, 2012. [52] Powers Perrin ORA. A new approach for managing operational risk, informal paper. 2009. [53] R. Frey A.J. McNeil and P. Embrechts. Quantitative Risk Management: Concepts,Techniques and Tools. Princeton University Press, 2005. 80