Risk Mitigation Strategies for Critical Infrastructures Based on Graph Centrality Analysis

Size: px
Start display at page:

Download "Risk Mitigation Strategies for Critical Infrastructures Based on Graph Centrality Analysis"

Transcription

1 Risk Mitigation Strategies for Critical Infrastructures Based on Graph Centrality Analysis George Stergiopoulos a, Panayiotis Kotzanikolaou b, Marianthi Theocharidou c, Dimitris Gritzalis a, a Information Security & Critical Infrastructure Protection Laboratory, Dept. of Informatics, Athens University of Economics & Business, 76 Patission Ave., GR-10434, Athens, Greece b Dept. of Informatics, University of Piraeus, 85 Karaoli & Dimitriou, GR-18534, Piraeus, Greece c European Commission, Joint Research Center (JRC), Institute for the Protection and the Security of the Citizen (IPSC), Security Technology Assessment Unit, via E. Fermi 2749, Ispra, I-21027, Italy Abstract Dependency Risk Graphs have been proposed as a tool to analyze the cascading failures of Critical Infrastructure (CI) dependency chains. However, dependency chain analysis is not by itself sufficient for developing an efficient risk mitigation strategy, i.e. a strategy that defines which CI nodes should have high priority for the application of mitigation controls, in order to achieve the optimal overall risk reduction. In this paper we extend our previous dependency risk analysis methodology towards efficient risk mitigation, by exploring the relation between dependency risk paths and graph centrality characteristics. We explore how graph centrality metrics can be applied to design and to evaluate the effectiveness of risk mitigation strategies. We examine alternative mitigation strategies and we empirically evaluate them through simulations. Our experiments are based on random graphs that simulate common CI dependency characteristics, as identified by recent empirical studies. Based on our experimental findings, we propose an algorithm that can be used to prioritize CI nodes for the application of mitigation controls, in order to achieve an efficient risk mitigation. Corresponding author addresses: geostergiop@aueb.gr (George Stergiopoulos), pkotzani@unipi.gr (Panayiotis Kotzanikolaou), marianthi.theocharidou@jrc.ec.europa.eu (Marianthi Theocharidou), dgrit@aueb.gr (Dimitris Gritzalis) Preprint submitted to International Journal of Critical Infrastructure Protection May 4, 2015

2 Keywords: Critical Infrastructure, Dependency Risk Graphs, Graph Centrality, Cascading failure, Information gain, feature selection, Mitigation. 1. Introduction (Inter)dependencies between Critical Infrastructures (CIs) is a key factor for Critical Infrastructure Protection, since CI dependencies may allow a seemingly isolated on a single CI, to cascade on multiple CIs. One way to analyze CI dependencies is by using Dependency Risk Graphs, i.e. graphs in which the nodes represent (modules of) CIs and their directed edges represent the potential risk that the destination node may suffer due to its dependency from the source node, in case of a failure being realized at the source CI. In our previous work [1, 2, 3, 4, 5] we have proposed a risk based methodology that can be used to assess the cumulative risk of dependency risk paths, i.e. of chains of CI nodes which are (inter)connected due to their dependencies. Our methodology uses as input the existing risk assessment results from each CI operator and, based on the 1storder dependencies between the CI nodes, it assesses the potential risk values of all the n-order dependency risk chains. By computing and sorting all the dependency risk paths, the assessor may identify the most critical dependency chains, by identifying all potential dependency risk paths with a cumulative risk above a predefined threshold. Although the identification of the most important dependency chains is an important step towards efficient risk mitigation, there are still open problems. A simple mitigation strategy would be to apply security controls at the root node of each critical dependency chain. However, this may not always be a cost effective strategy, since there are cases of nodes whose effect is not properly measured. For example, consider the case where a small subset of nodes (not necessarily being roots of the critical chains) exists that affect a large number of critical dependency paths. Decreasing the probability of failure in these nodes by selectively applying to them security controls, may have a greater overall benefit in risk reduction. Another example is when nodes of high importance 2

3 exist outside the most critical dependency paths; for example, nodes that may concurrently affect many other nodes which belong to the set of the most critical paths, or nodes that do affect the overall dependency risk of the entire structure/graph. Motivation. Existing risk mitigation methodologies are usually empirical, and they focus on CIs which are possible initiators of cascading failure and usually belong to specific sectors such as Energy and ICT. By studying actual large-scale failures, it is obvious that these two sectors often initiate serious cascading effects amongst interdependent CIs (e.g. the famous California blackout scenario or the recent New York blackout scenario of 2015). However, by only focusing on some sectors and on potential initiators may not be suitable in every case. The impact of each dependency should be taken into account together with the position of each CI within a network of interdependent CIs. The systematic identification of the most important nodes and the prioritization of nodes for the application of security controls can therefore be a complex task. The need for a high-level and efficient risk mitigation technique which takes into account multiple characteristics of the interdependent CIs is a well-known fact [6]. An optimal risk mitigation methodology would allow the detection of the smallest subset of CIs that have the highest effect on the overall risk reduction on a risk graph, even though the candidate nodes might not initiate a critical path or may not even belong to the most critical paths. Contribution. Towards this direction, in this paper we explore the use of graph centrality metrics in Dependency Risk Graphs, in order to effectively prioritize CIs when applying risk mitigation controls. Our goal is to design, implement and evaluate the effectiveness of alternative risk mitigation strategies. Each examined mitigation strategy is implemented algorithmically and it is empirically evaluated through simulations. The ultimate goal is to identify the minimum subset of CI nodes within a graph, whose risk treatment (application of security controls) would result in the maximum risk reduction in the entire graph of interdependent CIs. Our extensions are threefold: (a) We run various data mining experiments to identify correlations between central- 3

4 ity metrics and CI nodes that have high impact in a dependency risk graph; (b) We use the optimum centrality metrics, as identified in the previous step, in order to develop and test various risk mitigation strategies that maximize risk reduction. Results show that each mitigation strategy is most effective in achieving specific goals; (c) We take into consideration nodes with a high number of inbound (sinkholes) and outbound connections and compare risk reduction results. In order to validate the proposed mitigation strategies, we run experiments on hundreds of random graphs with randomly selected dependencies. These tests demonstrate the efficiency of the proposed risk mitigation strategies. All random graphs match criteria found in real-world CI graphs of interconnected infrastructures. The proposed mitigation strategies can be used as a proactive tool for the analysis of large-scale dependency infrastructure interconnections and pinpoint underestimated infrastructures that are critical to the overall risk of a dependency graph. 2. Building Blocks 75 Our methodology will be based on three main building blocks: (i) dependency risk graphs and multi-risk dependency analysis methodology, in order to model cascading failures; (ii) graph centrality analysis over dependency risk graphs and (iii) feature selection techniques in order to evaluate the effect of various centrality metrics in risk mitigation. We briefly describe these building blocks Multi-risk dependency analysis methodology 80 Dependency can be defined as the one-directional reliance of an asset, system, network, or collection thereof within or across sectors on an input, interaction, or other requirement from other sources in order to function properly [7]. Our work extends the dependency risk methodology of [1, 2] used for the analysis of multi-order cascading failures. 4

5 First-order dependency risk In [1, 2], initially the 1st-order dependencies between CIs, as identified and are then extended to model the n-order relations. Each dependency from a node CI i to a node CI j has been assigned 1 an impact value, denoted as I i,j and the likelihood L i,j of a disruption being realized. The product of the last two values 90 defines the dependency risk R i,j caused to the infrastructure CI j due to its dependency on CI i. Dependencies are visualized through graphs G = (N, E), where N is the set of nodes (or infrastructures or components) and E is the set of edges (or dependencies). The graph is directional and the destination CI is receiving a risk from the source CI due to its dependency Extending to n-order dependency risk Let CI = (CI 1,...CI m ) be the set of all the examined infrastructures. The algorithm examines each CI as a potential root of a cascading effect. Let CI Y0 denote a critical infrastructure examined as the root of a dependency chain and CI Y0 CI Y1... CI Yn denote a chain of length n. Then the algorithm 100 computes, for each examined chain, the cumulative Dependency Risk of the n-order dependency chain as: DR Y0,...,Y n = n R Y0,...,Y i i=1 i=1 j=1 n i ( L Yj 1,Y j ) I Yi 1,Y i (1) Informally, equation 1 computes the dependency risk contributed by each affected node in the chain, due to a failure realized in the source node. The computation of the risk is based on a risk matrix that combines the likelihood 105 and the incoming impact values of each vertex in the chain. Interested readers are referred to [2] for a detailed analysis. 1 The impact and likelihood values are assumed to be extracted by the organization level risk assessments performed by each CI operator. 5

6 2.2. Graph centrality metrics 110 Graph centrality analysis attempts to quantify the position of a node in relation to the other nodes in a graph. In other words, graph centrality metrics are used to estimate the relative importance of a node within a graph. Various centrality metrics are used, each of them measuring a different characteristic: Degree centrality measures the number of edges attached to each node. Given a node u, degree centrality is defined as: C d (u) = deg(u), where deg(u) is the total number of its outgoing and ingoing edges Closeness centrality quantifies the intuitive central or peripheral place- ment of a node in a two dimensional region, based on geodesic distances. It is defined as: C c (u) = δ(u, v), where δ(u, v) is the average shortest v V (G) path between the examined node u and any other node in the graph. Betweenness centrality measures the number of paths a node participates in, and is defined as: C b (u) = δ ij (u), where δ ij (u) = σij(u) σ ij. Here, u i j V σ ij (u) denotes the number of geodesic distances from i to j where node u is present and σ ij the number of geodesic distances from i to j. Bonacich (Eigenvector) centrality [8] attempts to measure the influence of a node in a network as: c i (α, β) = (α βc i )R i,j, where α is a scaling j factor, β reflects the extent to which centrality is weighted, R is the node adjacency matrix, I is the identity matrix and l is a matrix of ones. An adjacency matrix is a N N matrix with values of 1 if there is an edge between two nodes and 0 otherwise. Eccentricity centrality is similar to closeness centrality. Essentially, it is the greatest distance in all shortest-paths between u and any other vertex (geodesic distance) Feature selection Feature selection (also called variable selection or attribute selection) is used in data mining for the automatic selection of attributes in a data set that are 6

7 most relevant to a predictive modeling problem. Feature selection is critical for the effective analysis of characteristics in data-sets, since data-sets frequently contain redundant information that is not necessary to build a predictive model [9]. It increases both effectiveness and efficiency, since it removes noninformative terms according to corpus statistics [10]. In our case, we will explore centrality metrics as potential features in a dependency risk graph. Our predictive model is to detect which features (i.e. centrality metrics) are able to characterize nodes that mostly affect the cumulative risk of dependency paths. This model will identify which centrality metrics are the most useful (i.e. provide the most information) about whether a CI node greatly affects the overall risk on a graph or not. The various centrality types are ranked lower if they are common in both the so called positive set (nodes that have a high affect on the risk) and the negative set (nodes with low impact on risk paths). The features are ranked higher, if they are effective discriminators for a class. Our feature selection tests will use two well-known selection techniques, namely Information Gain and Gain Ratio. Information Gain[11] is a statistical measure that has been successfully applied for feature selection in information retrieval. Authors in [12, 13] used this algorithm for feature selection before training a binary classifier. This technique will be used to classify high centrality metrics in nodes, in an effort to distinguish which centrality metrics tend to appear often in nodes that greatly affect the cumulative risk of graphs. Gain Ratio is a variation of the Information Gain algorithm. The difference is that the Information Gain measure prefers to select attributes having a large number of values [14]. Gain Ratio overcomes this bias by taking into consideration the statistical appearance of a feature in both negatives and positive sets (i.e. if a specific Centrality metric is detected in dangerous CI nodes scarcely but is never found in safe nodes, it will have a rather low information gain rank but a very high gain ratio one). We use this technique to pinpoint (combinations of) Centrality metrics that might scarcely appear in graphs, yet significantly affect the cumulative risk in paths when present. 7

8 3. Exploring Centrality Metrics in Dependency Risk Graphs We will explore the effect of centrality metrics on the importance of nodes in dependency risk graphs. Recall that in such graphs the edges denote directed risk relations between nodes, and not topological connections. Intuitively, nodes with high centrality measures are expected to have a high effect in the overall dependency risk. The question that needs to be answered is which centrality metrics have the higher effect in risk propagation. Then, the nodes that are exhibiting the highest values of these centrality metrics will be good candidates for the implementation of risk mitigation controls. An algorithm capable of identifying these nodes will be able to design a cost-effective risk mitigation strategy Analysis of centrality metrics We will examine each centrality metric in dependency risk graphs, in order to analyze its effect to the overall risk Degree centrality A node with high degree centrality is a node with many dependencies. Since the edges in risk graphs are directional, the degree centrality should be examined for two different cases: 185 Inbound degree centrality, i.e. the number of edges ending to a node. A high inbound degree centrality would indicate nodes known as cascading resulting nodes [15]. Outbound degree centrality, i.e. the number of edges starting from a node. This is an indication of a cascading initiating node [15]. 190 Nodes with high inbound degree centrality in a risk graph are natural sinkhole points of incoming dependency risk. Nodes with high outbound degree centrality seem to be suitable nodes to examine, when prioritizing mitigation controls. Indeed, if proper mitigation controls are applied to such nodes, then 8

9 multiple cumulative dependency risk chains will simultaneously be reduced. Such a strategy could be a much more cost-effective alternative, from a mitigation strategy aiming at applying controls to particular high risk edges or high risk paths. Obviously, it is not certain that applying one or more security controls to a node with high outbound degree centrality, will positively affect many (or all) outgoing dependencies chains using this node. However, a mitigation strategy would benefit if it were to initially examine such possible security controls Closeness centrality Nodes with high closeness centrality are nodes that have short average distances from most nodes in a graph. In the case of a dependency risk graph, nodes with high closeness are nodes that tend to be part of many dependency chains; sometimes might even initiate them. Since cascading effects tend to affect relative short chains (empirical evidence indicates that cascades rarely propagate deeply [16]), intuitively, nodes with high closeness centrality will have a bigger effect in the overall risk of the dependency chains, than nodes with low closeness centrality. To formalize this, consider equation 1 used to compute the cumulative risk of a dependency chain: the closer a node is to the initiator of a cascading event, the more effect will have on the cumulative dependency risk since the likelihood of its outgoing dependency will affects all the partial risk values of the following dependencies (edges). A more effective way to exploit closeness centrality in mitigation decisions, is to compute the closeness of every node with respect to the subset of the most important initiator nodes. Regardless of the underlying methodology, the risk assessors will already have an a priori knowledge / intuition of the most important nodes for cascading failure scenarios. For example, empirical results show that energy and ICT nodes are the most common cascade initiators [16]. In addition, nodes with high outgoing degree centrality will probably be nodes that participate in multiple dependency risk chains. Thus it is possible to first identify the subset of the most important nodes for cascading failures and then 9

10 compute the closeness of all others in relation with this subset of nodes, as a secondary criterion for mitigation prioritization Eccentricity centrality Similar to closeness centrality, this is a measure of the centrality of a node in a graph which is based on having a small maximum distance from a node to every other reachable node. Here, small maximum distances are the greatest distances detected in all shortest-paths between v and any other vertex (geodesic distance). If the eccentricity of a CI node is high, then all other CI nodes are in proximity Betweenness centrality In a dependency risk graph, a node with high betweenness centrality will lie on a high proportion of dependency risk paths. This means that even though such nodes may not be initiating nodes of a cascading failure (high outbound centrality) or may not belong to a path with high cumulative dependency risk, they tend to contribute to multiple risk paths and thus play an important role to the overall risk. Applying mitigation measures to such nodes (in the form of security controls) will probably decrease the dependency risk of multiple chains simultaneously. Comparing closeness centrality with betweenness centrality, it seems that closeness should precede betweenness as a mitigation criterion. Although nodes that are between multiple paths will eventually affect multiple chains, it is possible that a node lies in multiple paths but it tends to be at the end of the chain and practically not affecting the cumulative dependency risk chain (recall that nodes with high-order dependency are rarely affected) Bonacich (Eigenvector) centrality 250 A node with high Bonacich [17] (eigenvector) centrality is a node that is adjacent to nodes with very high or very low influence, depending on whether variable beta is beta > 0 or beta < 0. In a risk dependency graph, we are interested in nodes with high eigenvector centrality where β > 0, since these 10

11 nodes are connected to other nodes which also have high connectivity. This is an interesting measure for CI risk graphs, as such nodes not only can cause cascading failures to more nodes, but they can cause multiple cascade chains of high risk. On the contrary, a less connected node means that it shares fewer dependencies with other nodes. It is affected only by specific nodes in the graph. This means that applying mitigation measures to such a node may not affect significantly the overall risk. On the contrary, if one applies mitigation controls to the node with high Eigenvector centrality (when β > 0), this means that the most powerful (or critical) is modified and this, in turn, affects several other important nodes Applying feature selection to classify centrality metrics In order to validate our intuitive analysis of centrality metrics in dependency risk graphs, described in section 3.1, we will use a data mining technique named feature selection to empirically study possible relations between the risk level of nodes in dependency paths (as assessed using the aforementioned methodology from [1]) and the various centrality metrics for nodes. This experiment will show how often nodes concurrently appear in; (a) the set of the most critical paths (i.e., the paths with the highest cumulative dependency risk value) and (b) to sets of nodes with the highest centrality values in a dependency risk graph. For our experiments, we simulated graphs that were randomized based on restrictions proposed in recent empirical studies [16, 15], in order to resemble CI dependencies based on real data. In particular: Occasional tight coupling. Some node relationships in the dependency risk graph are selected to have high impact values. To achieve this, in our experiments randomization applies random risk values with relatively high lower and upper likelihood and impact bounds. Interactive complexity. In reality, dependencies between CIs have high interactive complexity. Informally, this implies that we cannot foresee all the ways things can go wrong. In our experiments, we constructed 11

12 random graphs of 50 nodes with high complexity; the critical paths up to 4th-order dependencies have possible combinations ranging from 230,300 to 2,118,760 chains. 285 Number of dependencies. We randomly simulated CI nodes having from 1 to 7 connections (dependencies). Path length. Critical paths of length 3 to 4 hops are used. Number of initiators. More than 60% of the nodes in the graph act as initiators. 290 Connectivity of initiators. In our experiments, the initiators tend to have higher connectivity than the other nodes. In our experiments we used the Weka 2 tool (version 3.6). All experiments were conducted using the available implementation of the Information Gain and Gain Ratio Feature selection algorithms in Weka, using the program s Attribute Selection Ranker to rank output results Results of feature selection for centrality metrics We analyzed a data-set of 32,950 nodes extracted from about 700 graphs containing about 774,015,270 paths. Dependency graphs were created with the aforementioned restrictions. Feature selection algorithms were used to detect correlations between high centrality metrics and CI nodes that greatly affect the overall risk. Weka s output rankings provide insight on how high centrality features characterize dangerous nodes that affect the risk in top critical paths. Detailed output values for each test are presented in Table 1 and 2. Table 1 depicts the output from the Information Gain algorithm whereas Table 2 provides the Gain Ratio output:

13 INFORMATION GAIN Inbound Test Outbound Test Betweeness Eccentricity Closeness Eigenvector Intersection of all Centralities Inbound degree (sinkholes) Outbound degree Table 1: Weka s output ranking using the Information Gain algorithm GAIN RATIO Inbound Test Outbound Test Betweeness Eccentricity Closeness Eigenvector Intersection of all Centralities Inbound degree (sinkholes) Outbound degree Table 2: Weka s output ranking using the Gain Ratio algorithm Tests are divided into two sets: Inbound tests and Outbound tests. The difference between the two is that Inbound tests use Inbound Degree Centrality measures whereas Outbound tests compute everything using Outbound Degree Centrality measures. Results from Weka provided valuable information as to how nodes with high centrality values affect dependency risk graphs: 1. The amount of critical paths that had at least one node with high centrality measurements, is very high: An average of 74,4% of the most critical path in all graphs contained a node with at least one high centrality metric. This percentage seemed to increase dramatically to 99.9% for the top 10% of most critical paths, which leads to the conclusion that, the top 10% of paths frequently pass through the same nodes as the top 1% of paths. If we analyze even larger sample sets (more than 10% of critical paths) almost all nodes with high centrality are part of some critical path. 2. Gain ratio showed that CI nodes with high centrality measures in all types of centralities rarely appear in graphs but, when they do appear, 13

14 they definitely affect the cumulative risk of critical paths. CI nodes that have high measures in all centrality metrics are of high importance. 3. Nodes with high Closeness centrality seem to be the best candidates for detecting important CI nodes. Yet, results clearly show that all centrality types seem to be useful indicators of important CI nodes (information gain is above 20% for all features). Thus, all centrality metrics must be taken into account in each graph. 4. Nodes with high Closeness and/or Degree centrality, both outbound and inbound (sinkholes), tend to mitigate high amounts of risk. An average of 49,6% of all dangerous nodes tested had a high Closeness centrality and an average of 45% had a high Degree centrality value. 5. Risk mitigation is highly depended on the graph itself. For example, in graphs that contain nodes acting like single points of failure, high centrality values are excellent indicators of important CI nodes. If however we examine graphs in which, for example, all nodes have equal degree, the identification of nodes for risk mitigation is often less successful. Luckily, CI dependency nodes tend to have specific structural characteristics [16, 15]. 4. An Algorithm for Efficient Risk Mitigation Strategy We have concluded from the previous experiment that, even if nodes with high centrality are only a fraction of the nodes in the most critical risk paths, yet, they do affect all critical paths greatly. Also, feature analysis results suggest that all Centrality measures can play an important role, depending on the graph analyzed. Thus, it seems essential for a risk mitigation strategy not only to take into consideration which Centrality measures provide the best results (like high Closeness measures) but also to consider the diversities of each graph and adapt accordingly by choosing the correct combination of Centrality metrics to pinpoint nodes for applying risk mitigation controls. 14

15 Algorithm 1: Decision-tree traversal using Information Gain IG x (N) to select nodes for Risk Mitigation Data: High Centrality subsets: C = {C C, C B, C I, C Eg, C E } Head set of all nodes: N Subset of dangerous nodes: D N Information Gain of Centrality subset c over set D: IG c (D) Result: Set of selected nodes for risk mitigation S N 1. Start: Mark all sets in C as unused; Position at the root node; 2. Leaf: if C is empty then Output remaining nodes in D; END; else if IG c (D) = 0 c C then Output remaining nodes in D; END; else go to Step 5 ; end 3. Decide case: if no unused sets in C then go to Step 4 ; else go to Step 5 ; end 4. Backtrack: if at the root then STOP; else return to the vertex just above this one and go to Step 3 ; end 5. Decision: if IG c1 (D) = IG c2 (D) c 1, c 2 C then Select the left-most c from set C; else Select the c from set C with highest IG c (D) c C; end N N c; D D c; remove selected c from C; go to Step 2 ; 350 We incorporate these results from the previous feature selection analysis Section 3.2), in order to design a node selection algorithm for efficient risk mitigation. The proposed algorithm is heuristic and recursively chooses the 15

16 best set of nodes, by limiting its selection options, until it reaches a set of selected nodes of a predefined size Modeling feature selection results to algorithmic restrictions Recall from Section 3.2 that the two most important conclusions support that: (i) Nodes with high centrality measures in all types of centrality metrics always affect the total risk of a graph; (ii) Nodes with high Closeness and high Betweenness centrality seem to have a greater impact than the other metrics; (iii) If a node concurrently has high centrality values for more centrality metrics, it tends to have a higher affect on the risk graph. To model these observations into a dynamic node selection algorithm for risk mitigation, we implemented the following steps: 1. Model Definition: Define the subsets of nodes which exhibit the highest centrality values fir each centrality metric, as well as their Information Gain over the entire node set. 2. Setup: Pre-compute all centrality measures for each node in a graph. Using these centrality measures, generate all high centrality subsets. These subsets will act as features for feature selection. In addition, pre-compute all dependency risk paths of the graph and sort them according to their weight (total cumulative risk value of each path). 3. Use decision trees and Information Gain for node selection: Define the Information Gain method and its use in splitting node sets. 4. Traverse a decision tree using Information Gain splits: Traverse a decision tree to select the optimal subset of nodes for applying risk mitigation controls Model Definition 380 Let C A denote the subset of the top X% of nodes with the highest centrality measurements from all centrality sets; C E denote the subset of the top X% of nodes with the highest Eccentricity, C C denote the subset of the top X% of nodes with the highest Closeness centrality; C I denote the subset of the top 16

17 X% of nodes with the highest Inbound Degree centrality (sinkholes); C B denote the subset of the top X% of nodes with the highest Betweenness centrality and finally C Eg denote the subset of the top X% of nodes with the highest EigenVector centrality. Let N denote the set of all nodes in the risk graph. Let D N denote the subset of nodes that participate in the top 20 most critical risk paths of the graph. Finally, let IG x (N) denote the information gain when choosing subset C x as a feature of the node set N. For example, IG B (N) denotes the information gain when using high Betweenness centrality as a feature of set N. Information Gain values are calculated for each centrality feature subset separately. The high centrality subset C x with the highest Information Gain over N, is considered to affect at most the risk graph, in comparison with the other high centrality subsets. Example: In a Dependency Risk Graph, there exist nodes that act as initiators of cascading events in the three most critical paths(i.e., critical paths with the highest cumulative dependency risk value). Yet nodes exist that, even though they have smaller influence on the top three most critical paths, they affect the overall risk of the entire Dependency Risk Graph more than the aforementioned initiators. Our risk mitigation algorithm will be able to suggest these nodes over the three initiator nodes, using the Information Gain output from all high centrality node subsets Setup 405 The centrality measures of each node are pre-computed, along with all the dependency risk paths of the graph. For each Centrality measure, the algorithm uses the pre-computed centrality values and creates subsets C E, C C, C I, C B and C Eg. Then, the algorithm computes all risk paths in the graph and sorts them according their total accumulated risk. All output will be used as input to the risk mitigation algorithm to choose the most effective nodes for risk mitigation using countermeasures. 17

18 Using decision trees and Information Gain for node selection A decision can be made (and, practically, a decision tree can be traversed) by creating a split over the head set N of CI nodes and create smaller subsets (graph leaves). In a Dependency Risk Graph, the algorithm chooses the specific centrality node subset with the highest impact on the graph. Then, it outputs a new subset which is the intersection of the chosen subset with the original set. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the Information Gain calculation at a specific point in the decision tree produces the same Information Gain values for the target subset of nodes, or when splitting no longer adds information to the predictions. The algorithm in 1 is used to define a sequence of high centrality subsets that act as features. The risk mitigation algorithm investigates this sequence to rapidly narrow down the group of CI nodes with the highest impact in the Dependency Risk Graph. This feature selection increases both effectiveness and efficiency, since it removes non-informative CI nodes according to corpus statistics [10] Example: Traversing a decision tree using Information Gain splits The above algorithm aims to choose nodes with the highest impact on a Dependency Risk Graph. The above risk mitigation strategy can be simplified into the following sentence: Always choose the subset of nodes with the highest Information Gain from each step in the decision-tree. Let us assume to have a Dependency Risk Graph with all centrality measures calculated and all high centrality subsets defined. Thus the algorithm 1 would run as follows: 1. Step 1: We need to find which centrality feature produces the most Information Gain against the root set D N of nodes. The information gain is calculated for all four high Centrality subsets (features): IG C (D) = // Gain of Closeness subset over D IG B (D) = // Gain of Betweeness subset over D 18

19 IG S (D) = // Gain of Sinkholes subset over D 440 IG Eg (D) = // Gain of Eigenvector subset over D IG E (D) = // Gain of Eccentricity subset over D The Closeness high centrality subset provides the highest Information Gain, therefore it is used as the decision attribute in the root set D. From here on, D is the subset D C C of nodes and N becomes the N C C subset. 2. Step 2: The next question is what feature should be tested at the second branch? Since we have used C C at the root, we only decide on the remaining four centrality subsets: Betweenness, Sinkholes, Eigenvector and Eccentricity. IG B (D) = IG Eg (D) = IG S (D) = IG E (D) = In this step, the Betweenness subset shows the highest Information Gain; therefore, it is used as the decision node. From here on D is the subset D C B of nodes and N becomes the N C B subset. 3. Step 3 to Step N:This process goes on until all nodes are classified and D remains with three nodes, or if the algorithm runs out of Centrality attributes Validation of the proposed algorithm We empirically study the efficiency of the risk mitigation algorithm presented in Section 4. We run 2000 random experiments; in each experiment the graph was produced based on the limitations described in Section 3.2 and each graph contained up to 50 nodes. 19

20 Efficiency analysis of the proposed mitigation strategy We know now from previous results that centrality measures can indeed characterize important CI nodes that greatly affect the overall risk in a dependency risk graph. To evaluate the effectiveness of the proposed risk mitigation strategy we need to emulate the implementation of mitigation controls in the nodes identified as important (dangerous) nodes by the algorithm presented in Section 4. In practice, examples of controls can be the repair prioritization of nodes (which node to send a repair crew first), the increase of asset redundancy on a node, or any other control that may reduce the likelihood or the consequence of a failure. In our experiments, we emulate the implementation of mitigation controls to a node CI i, by reducing the values L i,j that define the likelihood to cascade a failure experienced at node CI i to another node CI j, node CI j that depends on node CI i. All likelihood values L i,j of all the dependencies that originate from CI i were reduced by 20%. To measure the result of applying risk mitigation controls on each selected subset of nodes, we calculated the dependency risk values in the same graph, before and after the implementation of risk mitigation and we calculated the risk reduction achieved in each case. We measured the efficiency of the proposed risk mitigation algorithm by calculating the percentage of risk reduction in graphs using the aforementioned emulation of mitigation controls. The effect on three risk metrics was examined using the proposed algorithm 1 from Section 4 : (i) risk reduction achieved in the most critical path, (ii) risk reduction in the total sum of the risk of the top 20 paths of the highest cumulative dependency risk and (iii) risk reduction in the total sum of the entire risk from all graph paths. Only acyclic paths were taken into consideration to reduce overlapping: For example, if a high risk path was cyclic, its acyclic part would most likely appear as a high risk path too. Thus, we avoided calculating cyclic paths in our experiments. Mitigation controls were implemented on three (3) out of the fifty (50) CI nodes in each experiment (6% of the nodes in the entire graph). 20

21 Figure 1: Risk reduction in the most critical path Effect of the algorithm in the most critical path 495 Using the proposed algorithm from Section 4 to select CI nodes for imple- menting mitigation controls, we achieved a average risk reduction of 12.1%. The highest risk reduction achieved in all experiments was 43.7%; almost half the risk of the most critical path was mitigated. The risk mitigation achieved was the highest achieved in all different mitigation strategies tested. A more 500 thorough comparison of the results can be found in the following section 5.2. As it is shown in Figure 1 in about 25% of the times (498 out of the 2000 tests) there is no risk reduction to the most critical path. This however is not a primary concern, since our goal is to reduce the overall risk of the dependency risk graph and not necessarily on nodes of the most critical path that are easy to identify. 505 We note however that when mitigate controls were applied on nodes belonging to the the most critical path, the algorithm achieved better risk reduction results than alternative well-known mitigation strategies, such as the implementation of mitigation controls on the initiator of the most critical path. 21

22 Figure 2: Risk reduction in the top 20 most critical paths Effect of the algorithm in the top 20 risk paths 510 In Figure 2, we see that the highest risk reduction achieved in the sum of all the risk values derived from the top 20 critical paths, was 37.5%. The average risk reduction achieved in graphs using the proposed algorithm was 9.8%. In addition, we see that, in 2000 experiments, only 12 graphs had a structure that prohibited our algorithm from achieving any risk reduction in the top 20 paths; % of the algorithm s applications achieved some risk mitigation and the 47.3% of applications achieved high risk reduction (10% or more). These are interesting results: The top 20 most critical paths are comprised of 1000 nodes in each experiment. Even though paths share CI nodes, being able to choose only 3 out of 50 nodes for risk mitigation in 20 paths is far 520 more complex than choosing the correct nodes only in the most critical path comprised of 5 nodes. Again, the proposed algorithm had the best performance out of the three different mitigation strategies tested. In addition, by monitoring the algorithm s execution and CI node choices, we validated some preliminary results from Feature Analysis: The risk mitigation 525 algorithm was mostly choosing CI nodes with high centrality measures in all 22

23 centrality metrics (i.e. the intersection of all high centrality sets for all centrality metrics). Besides aggregating all centrality metrics, the second most frequent combination of centrality types was nodes with high Degree centrality (e.g. sinkholes) and Closeness centrality Effect of the algorithm in the entire dependency risk graph Figure 3: Risk reduction in the complete set of paths Last but not least, we analyzed the efficiency of the proposed algorithm when trying to reduce risk in the entire risk graph; not only on the top 20 paths. Here, we tested risk mitigation using the proposed node selection algorithm on the sum of all paths in dependency graphs. The results from 2000 experiments 535 are shown in Figure 3. We see that the highest risk reduction achieved was 12.2%, with an average risk reduction of 7.5% in all experiments. Interestingly, in 86.4% of the tests, the risk reduction achieved was between 5% and 10% with an average of around 8.7%. With such low dispersion of results, statistical probability suggests that the proposed algorithm s output shows strong signs 540 of uniformity and homogeneity. Again, the proposed algorithm had the best performance out of the three different mitigation strategies tested. 23

24 5.2. Comparing results with alternative mitigation strategies 545 After presenting results of the proposed algorithm, we will now compare our proposed risk mitigation algorithm with alternate mitigation mechanisms to evaluate its performance. Two alternate mitigation mechanisms for choosing CI nodes were used, both well-known and frequently proposed scenarios Using initiators from the most critical paths 550 The first strategy is based on the idea that initiators are critical points of failure in dependency risk chains. The first alternate mitigation strategy chooses to implement mitigation controls on CI nodes that are the initiators of critical dependency paths. Thus, in our experiments, three (3) out of the fifty (50) CI nodes are chosen (6% of the nodes in the entire graph) that act as initiators of the corresponding top 3 most critical paths Using nodes with high Degree Centrality (sinkholes and outbound) The second strategy focuses on the common idea amongst auditors that CI nodes with high inbound and outbound Degree Centrality (i.e. nodes with a high amount of incoming and/or outgoing connections) affect dependency risk paths a lot. For this reason, the second mitigation strategy chooses nodes the top 6% of nodes with the highest inbound and outbound Degree Centrality measures to apply mitigation controls Comparison results Strategies Risk Metrics Most critical path Top 20 critical paths Entire graph Information Gain Top Initiators Top Sinkholes 43.7% (max) 12.1% (avg) 37.5% (max) 9.8% (avg) 12.2% (max) 7.5% (avg) 38.4% (max) 11.8% (avg) 28.7% (max) 10.0% (avg) 10.1% (max) 5.3% (avg) 34.5% (max) 10.3% (avg) 29.8% (max) 7.3% (avg) 10.8% (max) 6.7% (avg) Table 3: Comparison of results from all mitigation strategies Table 3 shows the results from all mitigation strategies. Output marked in red exhibits the highest values achieved. Clearly, in our tests, the proposed de- 24

25 565 cision making algorithm using Information Gain had a clear advantage over the rest as far as pure numbers are concerned. Using the top initiators mechanism scored a higher average in reducing risk at the top 20 most critical paths, albeit only marginally Implementation details We briefly describe the technical aspects of the executed tests; platforms, technologies and algorithm implementations. All the tests were performed on an Intel Core i-7 with 4 cores and 16 GB of RAM. In order to test the proposed risk mitigation strategies, we developed an automatic Dependency Risk Graph generator in Java. The application is able to randomly create risk graphs and, then, run the mitigation algorithms on the developed models. Finally, it provides information on the percentage of risk reduction achieved using each of the mitigation mechanisms. We selected Neo4J [18] as a technology to implement our dependency analysis modeling tool, due to its adaptability, scalability and efficiency [19, 20]. Neo4J builds upon the property graph model; nodes may have various labels and each label can serve as an informational entity. The nodes are connected via directed, typed relationships. Both nodes and relationships hold arbitrary properties (key-value pairs). Using the Neo4J technology, the developed framework can represent complex graphs of thousands of dependent CIs through a weighted, directed graph. The developed application is stand-alone: It builds a random Dependency Risk Graph using the aforementioned restrictions to imitate real-world CI dependencies. It then pre-calculates all necessary graph data and executes the selected risk mitigation algorithm. When finished, it outputs a percentage of the risk reduction achieved in the current experiment/graph. 6. Conclusions 590 In this paper, we extended our previous method on dependency risk graph analysis, using graph centrality measures as additional criteria for the evaluation 25

26 of alternative risk mitigation strategies. We used feature selection algorithms and confirmed that the most critical paths in dependency risk graphs tend to involve nodes with high centrality measures. Since various centrality metrics exist, we performed multiple tests in order to define which measures and combinations are the most Our results showed that aggregating all centrality sets to identify nodes with high overall centrality provides an optimal mitigation strategy. However, by comparing the different centrality metrics, we showed that nodes with high Closeness and high Degree centrality measures seem to have the highest impact on the overall risk of a graph. Still, simply choosing nodes from these sets is not always a viable choice, as there exist dependency graphs where no nodes exist in specific high centrality sets or there may be contextual reasons that inhibit the application of controls in these nodes. For this reason, we developed a risk mitigation algorithm that uses a decision tree built from subsets of nodes with high centrality measures for each centrality metric. The Information Gain algorithm is used to traverse the tree and choose subsets of high centrality nodes that best characterize each graph under test. Thus, the algorithm is able to dynamically pinpoint the optimal subset of nodes with high centrality features, for overall risk mitigation Advantages and Limitations of the proposed mitigation method Although the proposed mitigation algorithm achieved the highest maximum risk mitigation when compared with well-known mitigation strategies, our experiments showed that there are cases where the algorithm will not select nodes belonging to the most critical path set. This can be considered as a potential limitation, and at the same time, a desired characteristic. On one hand, the proposed algorithm will not always reduce the risk of the most critical path. Note however that if this is required, it can be trivially achieved by selecting the initiator of the most critical path. On the other hand, the proposed algorithm outperforms all its competitors, since it searches for nodes that have a great effect at the overall risk and not for a subset of critical paths. This is the main advantage of the proposed algorithm; it can be used to identify CI nodes 26

27 which may not appear in the top critical path(s) but they greatly affect the overall risk of a dependency graph. While the exhaustive computation of the complete set of dependency risk paths may provide useful information and reveal underestimated CI nodes, assessors may also use the algorithm to examine particularly complex scenarios were manually selecting nodes for implementing mitigation controls might be very time consuming and/or impractical. A limitation of this approach is the need for input data from risk assessments performed at the each CIs. This is an inherent problem of all empirical risk approaches, since they analyze dependencies based on previous incidents. Sector coordinators or regulators may collect data concerning a specific sector (such as ICT or energy). National CIP authorities or CERTs may be able to collect such information. References [1] P. Kotzanikolaou, M. Theoharidou, D. Gritzalis, Cascading effects of common-cause failures in critical infrastructures, in: J. Butts, S. Shenoi (Eds.), Critical Infrastructure Protection, Vol. 417 of IFIP Advances in Information and Communication Technology, Springer, 2013, pp [2] P. Kotzanikolaou, M. Theoharidou, D. Gritzalis, Assessing n-order dependencies between critical infrastructures, IJCIS 9 (1/2) (2013) [3] P. Kotzanikolaou, M. Theoharidou, D. Gritzalis, Interdependencies between critical infrastructures: Analyzing the risk of cascading effects, in: Critical Information Infrastructure Security, Springer, 2013, pp [4] M. Theoharidou, P. Kotzanikolaou, D. Gritzalis, Risk assessment methodology for interdependent critical infrastructures, International Journal of Risk Assessment and Management 15 (2) (2011) [5] M. Theoharidou, P. Kotzanikolaou, D. Gritzalis, A multi-layer criticality assessment methodology based on interdependencies, Computers & Security 29 (6) (2010)

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu

More information

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

More information

Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results Evaluation of a New Method for Measuring the Internet Distribution: Simulation Results Christophe Crespelle and Fabien Tarissan LIP6 CNRS and Université Pierre et Marie Curie Paris 6 4 avenue du président

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Three Effective Top-Down Clustering Algorithms for Location Database Systems

Three Effective Top-Down Clustering Algorithms for Location Database Systems Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

8.1 Min Degree Spanning Tree

8.1 Min Degree Spanning Tree CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree

More information

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

More information

Content Delivery Network (CDN) and P2P Model

Content Delivery Network (CDN) and P2P Model A multi-agent algorithm to improve content management in CDN networks Agostino Forestiero, forestiero@icar.cnr.it Carlo Mastroianni, mastroianni@icar.cnr.it ICAR-CNR Institute for High Performance Computing

More information

How To Find Influence Between Two Concepts In A Network

How To Find Influence Between Two Concepts In A Network 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Influence Discovery in Semantic Networks: An Initial Approach Marcello Trovati and Ovidiu Bagdasar School of Computing

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Sampling Biases in IP Topology Measurements

Sampling Biases in IP Topology Measurements Sampling Biases in IP Topology Measurements Anukool Lakhina with John Byers, Mark Crovella and Peng Xie Department of Boston University Discovering the Internet topology Goal: Discover the Internet Router

More information

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture - 36 Location Problems In this lecture, we continue the discussion

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* Rakesh Nagi Department of Industrial Engineering University at Buffalo (SUNY) *Lecture notes from Network Flows by Ahuja, Magnanti

More information

Competitive Analysis of On line Randomized Call Control in Cellular Networks

Competitive Analysis of On line Randomized Call Control in Cellular Networks Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising

More information

On Correlating Performance Metrics

On Correlating Performance Metrics On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks Graph Theory and Complex Networks: An Introduction Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Computer networks Version: March 3, 2011 2 / 53 Contents

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,

More information

Introduction to Scheduling Theory

Introduction to Scheduling Theory Introduction to Scheduling Theory Arnaud Legrand Laboratoire Informatique et Distribution IMAG CNRS, France arnaud.legrand@imag.fr November 8, 2004 1/ 26 Outline 1 Task graphs from outer space 2 Scheduling

More information

Coordination in vehicle routing

Coordination in vehicle routing Coordination in vehicle routing Catherine Rivers Mathematics Massey University New Zealand 00.@compuserve.com Abstract A coordination point is a place that exists in space and time for the transfer of

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,

More information

Exploiting the dark triad for national defense capabilities. Dimitris Gritzalis

Exploiting the dark triad for national defense capabilities. Dimitris Gritzalis Exploiting the dark triad for national defense capabilities Dimitris Gritzalis May 2015 Exploiting the dark triad for national defense capabilities Professor Dimitris A. Gritzalis (dgrit@aueb.gr) Information

More information

Revenue Management for Transportation Problems

Revenue Management for Transportation Problems Revenue Management for Transportation Problems Francesca Guerriero Giovanna Miglionico Filomena Olivito Department of Electronic Informatics and Systems, University of Calabria Via P. Bucci, 87036 Rende

More information

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS Carlos Andre Reis Pinheiro 1 and Markus Helfert 2 1 School of Computing, Dublin City University, Dublin, Ireland

More information

Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths

Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths Satyajeet S. Ahuja, Srinivasan Ramasubramanian, and Marwan Krunz Department of ECE, University of Arizona, Tucson,

More information

Decision-Tree Learning

Decision-Tree Learning Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

More information

Strategic Online Advertising: Modeling Internet User Behavior with

Strategic Online Advertising: Modeling Internet User Behavior with 2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

More information

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems 215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1 System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect

More information

Static IP Routing and Aggregation Exercises

Static IP Routing and Aggregation Exercises Politecnico di Torino Static IP Routing and Aggregation xercises Fulvio Risso August 0, 0 Contents I. Methodology 4. Static routing and routes aggregation 5.. Main concepts........................................

More information

Fault Analysis in Software with the Data Interaction of Classes

Fault Analysis in Software with the Data Interaction of Classes , pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Social Media Mining. Graph Essentials

Social Media Mining. Graph Essentials Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Scheduling Shop Scheduling. Tim Nieberg

Scheduling Shop Scheduling. Tim Nieberg Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations

More information

Fault Localization in a Software Project using Back- Tracking Principles of Matrix Dependency

Fault Localization in a Software Project using Back- Tracking Principles of Matrix Dependency Fault Localization in a Software Project using Back- Tracking Principles of Matrix Dependency ABSTRACT Fault identification and testing has always been the most specific concern in the field of software

More information

A Brief Introduction to Property Testing

A Brief Introduction to Property Testing A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

Load balancing Static Load Balancing

Load balancing Static Load Balancing Chapter 7 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

More information

How To Find Local Affinity Patterns In Big Data

How To Find Local Affinity Patterns In Big Data Detection of local affinity patterns in big data Andrea Marinoni, Paolo Gamba Department of Electronics, University of Pavia, Italy Abstract Mining information in Big Data requires to design a new class

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Load Balancing and Termination Detection

Load Balancing and Termination Detection Chapter 7 Load Balancing and Termination Detection 1 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application 2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

More information

Baseline Code Analysis Using McCabe IQ

Baseline Code Analysis Using McCabe IQ White Paper Table of Contents What is Baseline Code Analysis?.....2 Importance of Baseline Code Analysis...2 The Objectives of Baseline Code Analysis...4 Best Practices for Baseline Code Analysis...4 Challenges

More information

root node level: internal node edge leaf node CS@VT Data Structures & Algorithms 2000-2009 McQuain

root node level: internal node edge leaf node CS@VT Data Structures & Algorithms 2000-2009 McQuain inary Trees 1 A binary tree is either empty, or it consists of a node called the root together with two binary trees called the left subtree and the right subtree of the root, which are disjoint from each

More information

A Decision Theoretic Approach to Targeted Advertising

A Decision Theoretic Approach to Targeted Advertising 82 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 A Decision Theoretic Approach to Targeted Advertising David Maxwell Chickering and David Heckerman Microsoft Research Redmond WA, 98052-6399 dmax@microsoft.com

More information

A Non-Linear Schema Theorem for Genetic Algorithms

A Non-Linear Schema Theorem for Genetic Algorithms A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland

More information

Schedule Risk Analysis Simulator using Beta Distribution

Schedule Risk Analysis Simulator using Beta Distribution Schedule Risk Analysis Simulator using Beta Distribution Isha Sharma Department of Computer Science and Applications, Kurukshetra University, Kurukshetra, Haryana (INDIA) ishasharma211@yahoo.com Dr. P.K.

More information

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani Overview:

More information

Simulating a File-Sharing P2P Network

Simulating a File-Sharing P2P Network Simulating a File-Sharing P2P Network Mario T. Schlosser, Tyson E. Condie, and Sepandar D. Kamvar Department of Computer Science Stanford University, Stanford, CA 94305, USA Abstract. Assessing the performance

More information

Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES Silvija Vlah Kristina Soric Visnja Vojvodic Rosenzweig Department of Mathematics

More information

Efficient Data Structures for Decision Diagrams

Efficient Data Structures for Decision Diagrams Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

Network Performance Monitoring at Small Time Scales

Network Performance Monitoring at Small Time Scales Network Performance Monitoring at Small Time Scales Konstantina Papagiannaki, Rene Cruz, Christophe Diot Sprint ATL Burlingame, CA dina@sprintlabs.com Electrical and Computer Engineering Department University

More information

Methods for Firewall Policy Detection and Prevention

Methods for Firewall Policy Detection and Prevention Methods for Firewall Policy Detection and Prevention Hemkumar D Asst Professor Dept. of Computer science and Engineering Sharda University, Greater Noida NCR Mohit Chugh B.tech (Information Technology)

More information

Laboratory work in AI: First steps in Poker Playing Agents and Opponent Modeling

Laboratory work in AI: First steps in Poker Playing Agents and Opponent Modeling Laboratory work in AI: First steps in Poker Playing Agents and Opponent Modeling Avram Golbert 01574669 agolbert@gmail.com Abstract: While Artificial Intelligence research has shown great success in deterministic

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Load Balancing. Load Balancing 1 / 24

Load Balancing. Load Balancing 1 / 24 Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

More information

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching A Partition-Based Efficient Algorithm for Large Scale Multiple-Strings Matching Ping Liu Jianlong Tan, Yanbing Liu Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing,

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Mining Social-Network Graphs

Mining Social-Network Graphs 342 Chapter 10 Mining Social-Network Graphs There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is

More information

2004 Networks UK Publishers. Reprinted with permission.

2004 Networks UK Publishers. Reprinted with permission. Riikka Susitaival and Samuli Aalto. Adaptive load balancing with OSPF. In Proceedings of the Second International Working Conference on Performance Modelling and Evaluation of Heterogeneous Networks (HET

More information

Simulation of Heuristic Usage for Load Balancing In Routing Efficiency

Simulation of Heuristic Usage for Load Balancing In Routing Efficiency Simulation of Heuristic Usage for Load Balancing In Routing Efficiency Nor Musliza Mustafa Fakulti Sains dan Teknologi Maklumat, Kolej Universiti Islam Antarabangsa Selangor normusliza@kuis.edu.my Abstract.

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences

More information

Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

More information

Topology-based network security

Topology-based network security Topology-based network security Tiit Pikma Supervised by Vitaly Skachek Research Seminar in Cryptography University of Tartu, Spring 2013 1 Introduction In both wired and wireless networks, there is the

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P) Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...

More information

Optimal Binary Search Trees Meet Object Oriented Programming

Optimal Binary Search Trees Meet Object Oriented Programming Optimal Binary Search Trees Meet Object Oriented Programming Stuart Hansen and Lester I. McCann Computer Science Department University of Wisconsin Parkside Kenosha, WI 53141 {hansen,mccann}@cs.uwp.edu

More information

Predicting Influentials in Online Social Networks

Predicting Influentials in Online Social Networks Predicting Influentials in Online Social Networks Rumi Ghosh Kristina Lerman USC Information Sciences Institute WHO is IMPORTANT? Characteristics Topology Dynamic Processes /Nature of flow What are the

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information