UNIVERSITY OF LYON DOCTORAL SCHOOL OF COMPUTER SCIENCES AND MATHEMATICS P H D T H E S I S. Specialty : Computer Science. Author

Transcription

1 UNIVERSITY OF LYON DOCTORAL SCHOOL OF COMPUTER SCIENCES AND MATHEMATICS P H D T H E S I S Specialty : Computer Science Author Sérgio Rodrigues de Morais on November 16, 29 Bayesian Network Structure Learning with Applications in Feature Selection Jury : Reviewers : Pr. Philippe Leray - University of Nantes Pr. Florence D Alché-Buc - University of Evry-Val d Essonne Examinators : PhD. David Garcia - Pôle Européen de Plasturgie PhD. Emmanuel Mazer - GRAVIR Laboratory Thesis Advisors : Pr. Alexandre Aussem - UCBL, University of Lyon Pr. Joël Favrel - INSA, University of Lyon

2 I would like to dedicate this thesis to my loving parents, my brother and my sister...

3 Acknowledgements I will forever be thankful to my PhD advisor, Professor Alexandre Aussem. His scientific advice and insightful discussions were essential for this work. Alexandre has been supportive and has given me the freedom to pursue my own proposals without objection. The most important, he has believed in me and given me opportunities that nobody would have... Thanks Alex! I also thank my collaborators, specially David Garcia and Philippe Le Bot. Their enthusiasm and professionalism are contagious. Their questions and advices were of great significance for my work. Thank you for your patience and help. I am also very grateful to Joël Favrel, who was one of my two PhD advisors. He was a role model for a scientist, mentor, and teacher. Thank you for your advices and kindness. I would like to acknowledge the Ligue contre le Cancer, Comité du Rhône, France, which supported the work of chapter 6. The dataset used in this chapter was kindly supplied by the International Agency for Research on Cancer (Lyon - France). I would also like to acknowledge Sophie Rome and the Institut des Sciences Complexes (Lyon - France), who supported and helped in the work presented in chapter 7. Finally, I would like to acknowledge André Tchernof and the Centre de recherche en endocrinologie moléculaire et oncologique et génomique humaine (Québec - Canada), who supported and gave great assistance to the work presented in chapter 8.

4 Abstract The study developed in this thesis focuses on constraint-based methods for identifying the Bayesian networks structure from data. Novel algorithms and approaches are proposed with the aim of improving Bayesian network structure learning with applications to feature subset selection, probabilistic classification in the presence of missing values and detection of the mechanism of missing data. Extensive empirical experiments were carried out on synthetic and real-world datasets in order to compare the methods proposed in this thesis with other state-of-the-art methods. The applications presented include extracting the relevant risk factors that are statistically associated with the Nasopharyngeal carcinoma, a robust analysis of type 2 diabetes from a dataset consisting of 22,283 genes and only 143 samples and a graphical representation of the statistical dependencies between 34 clinical variables among 15 obese women with various degrees of obesity in order to better understand the pathophysiology of visceral obesity and provide guidance for its clinical management. Keywords: Bayesian networks, feature subset selection, missing data mechanism, classification, pattern recognition.

5 Contents 1 Introduction An Overview Author s Contributions Applications Outline Background about Bayesian networks and structure learning Introduction Some principles of Bayesian networks Markov condition and d-separation Markov equivalence Embedded Faithfulness Markov blankets and boundaries Constraint-based structure learning Soundness of constraint-based algorithms G likelihood-ratio conditional independence test Fisher s Z test Existence of a perfect map Conditional independence models Graphical independence models Algebraic independence models Graph-Isomorph iv

6 CONTENTS I STRUCTURE LEARNING 31 3 Local Bayesian network structure search Introduction Preliminaries Conditional Independence Test Pitfalls and related work HPC : the Hybrid Parents and Children algorithm HPC correctness under faithfulness condition Experimental validation Accuracy Scalability MBOR: an extension of HPC for feature selection Discussion and conclusions Conservative feature selection with missing data Introduction Preliminaries Dealing with missing values Deletion process A conservative Markov blanket A conservative independence test Extension to conditional G-tests Experimental evaluation Limits of the conservative test Ramoni and Sebastiani s benchmark Procedure used to remove data Results of the empirical experiments Discussion and Conclusions Exploiting data missingness through Bayesian network modeling Introduction Related work Detecting the missing data mechanism v

7 CONTENTS 5.4 Including the missing mechanism to classification models Empirical experiments Czech car factory dataset Congressional voting dataset Discussion and conclusions II APPLICATIONS 94 6 Analysis of nasopharyngeal carcinoma risk factors Introduction Graph construction with inclusion of domain knowledge Graph-based analysis and related work Predictive performance Model calibration Detection of the missing mechanisms Discussion and conclusions Robust gene selection from microarray data Introduction Robust feature subset selection Ensemble FSS by consensus ranking Experiments Robustness versus classification accuracy Ensemble FSS technique on Diabetes data Discussion and conclusions Analysis of lifestyle and metabolic predictors of visceral obesity with Bayesian networks Introduction Simulation experiments with HPC Results on biological data Discussion and conclusions vi

8 CONTENTS 9 Conclusions and Future Work Summary Future Work vii

9 List of Figures 2.1 Toy example of causal network presented in the WCCI28 Causation and Prediction Challenge Three Markov equivalent DAGs. There are no other DAGs Markov equivalent to them The marginal distribution of V, S, L and F cannot satisfy the faithfulness condition with any DAG Toy problem about PC learning: Z PC T, so that, X G T Z Divide-and-conquer algorithms can be less data-efficient than incremental algorithms HPC empirical evaluation in terms of scalability HPC empirical evaluation in terms of Euclidean distance from perfect precision and recall HPC empirical evaluation in terms of the number of false positives HPC empirical evaluation in terms of the number of false negatives GreedyGmax s p-value as a function of the ratio of missing data Subgraph taken from the benchmark ALARM displaying the MB of the variable SHUNT GreedyGmax s p-value as a function of the ratio of missing data when testing on variables of the benchmark ALARM Original BN benchmark used by [Ramoni & Sebastiani (21)] MCAR made from the original BN benchmark used by [Ramoni & Sebastiani (21)] viii

10 LIST OF FIGURES 4.6 MAR made from the original BN benchmark used by [Ramoni & Sebastiani (21)] NMAR (IM) made from the original BN benchmark used by [Ramoni & Sebastiani (21)] Toy examples of missing completely at random (MCAR) Toy examples of missing at random (MAR) Toy examples of not missing at random (NMAR/IM) Probability tables used to vary the missing data ratio of the DAG shown in Figure Average accuracy in detecting the mechanism NMAR (IM) of toy problem shown in Figure Graphical representation of the MCAR, MAR and NMAR (IM) used for empirical experiments Bayesian network used for generating data from the congressional voting records dataset Empirical evaluation of GMB on a congressional voting reports dataset Empirical evaluation of GMB for MCAR, MAR and IM (NMAR) Local BN graph skeleton around variable NPC Local PDAG of Figure The ROC curves obtained by 1-fold cross-validation with a Naive Bayes classifier Model calibration. Top: Markov boundary. Bottom: all variables NPC graph with dummy missingness variables shown in dotted line Robustness vs MB size for the benchmarks Genes and Pigs Comparative accuracy for the benchmarks Genes and Pigs MBOR outputs for a microarray data Bootstrap-based validation for the algorithm HPC on datasets from the benchmark INSULIN BN learned from 34 risk factors related to lifestyle, adiposity, body fat distribution, blood lipid profile and adipocyte sizes ix

11 Chapter 1 Introduction 1.1 An Overview A Bayesian networks (BN) is a graphical structure for representing the probabilistic relationships among a large number of features (or variables 1 ) and for doing probabilistic inference with those features. The graphical nature of Bayesian networks gives a very intuitive grasp of the relationships among the features. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Bayesian networks are used for modeling knowledge in computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis), medicine, document classification, information retrieval, image processing, data fusion, decision support systems, engineering, gaming and law. The term Bayesian networks was coined by Judea Pearl to emphasize three aspects [Pearl (1986)]: 1. The often subjective nature of the input information. 2. The reliance on Bayes s conditioning as the basis for updating information. 3. The distinction between causal and evidential modes of reasoning, which underscores Thomas Bayes posthumously published paper of 1763 [Bayes (1763)]. 1 The two terms feature and variable are used without distinction in this thesis. 1

12 1.1 An Overview There are numerous representations available for data analysis, including rule bases, decision trees, and artificial neural networks; and there are many techniques for data analysis such as density estimation, classification, regression, and clustering. So what do Bayesian networks have to offer? There are at least three answers: 1. Bayesian networks can readily handle incomplete data sets. For example, consider a classification or regression problem where two of the explanatory or input variables are strongly anti-correlated. This correlation is not a problem for standard supervised learning techniques, provided all inputs are measured in every case. When one of the inputs is not observed, however, most models will produce an inaccurate prediction, because they do not encode the dependencies between the input variable. Bayesian networks offer a natural way to encode such dependencies. 2. Bayesian networks help in the process of learning about causal relationships. Learning about causal relationships are important for at least two reasons. The process is useful when we are trying to gain understanding about a problem domain, for example, during exploratory data analysis. In addition, knowledge of causal relationships allows us to make predictions in the presence of interventions. For example, a marketing analyst may want to know whether or not it is worthwhile to increase exposure of a particular advertisement in order to increase the sales of a product. To answer this question, the analyst would like to determine whether or not the advertisement is a cause for increased sales, and to what degree. 3. Bayesian networks in conjunction with Bayesian statistical techniques facilitate the combination of domain knowledge and data. Anyone who has performed a real-world analysis knows the importance of prior or domain knowledge, especially when data is scarce or expensive. The fact that some commercial systems (i.e., expert systems) can be built from prior knowledge alone is a testament to the power of prior knowledge. Bayesian networks have a causal semantics that makes the encoding of causal prior knowledge particularly straightforward. In addition, Bayesian networks encode 2

13 1.1 An Overview the strength of causal relationships with probabilities. Consequently, prior knowledge and data can be combined with well-studied techniques from Bayesian statistics. Learning a Bayesian network from data requires identifying both the model structure G and the corresponding set of model parameter values. However, the study developed in this thesis focuses only on methods for identifying the Bayesian networks structure from data. The problem of learning the most probable a posteriori BN from data is worst-case NP-hard [Chickering (22); Chickering et al. (24)] and the recent explosion of high dimensional datasets poses a serious challenge to existing BN structure learning algorithms. Two types of BN structure learning methods have been proposed so far: constraint-based (CB) and scoreand-search methods. While score-and-search methods are efficient for learning the whole BN structure, the ability to scale up to hundreds of thousands of variables is a key advantage of CB methods over score-and-search methods. All the proposals of this thesis focus on improving constraint-based methods. Several CB algorithms have been proposed recently for local BN structure learning [Fu et al. (28); Nilsson et al. (27a); Peña (28); Peña et al. (27); Tsamardinos & Brown (28); Tsamardinos et al. (26)]. They search for conditional independence relationships among the variables on a dataset and construct a local structure around the target node without having to construct the whole BN first, hence their scalability. These algorithms are appropriate for situations where the sample size is large enough with respect to the network degree. That is, the number of parents and children (PC set) of each node in the network is relatively small with respect to the number of instances in the dataset. However, they are plagued with a severe problem: the number of false negatives increases swiftly as the size of the PC set increases. This well known problem is common to all CB methods and has led several authors to reduce, as much as possible, the size of the conditioning sets with a view to enhancing the data-efficiency of their methods [Fu et al. (28); Peña et al. (27); Tsamardinos et al. (26)]. 3

14 1.2 Author s Contributions 1.2 Author s Contributions Main contributions to the field of constraint-based Bayesian network structure learning made by the author include: 1. A novel structure learning algorithm called Hybrid Parents and Children (HPC ) [Aussem et al. (29b)]. HPC was proven to be correct under the faithfulness condition. Extensive empirical experiments were provided on public synthetic and real-world datasets of various sample sizes to assess HPC s accuracy and scalability. It was shown that significant improvements were obtained. In addition the number of calls to the independence test (and hence the effective complexity) is only O(n 1.9 ) in practice on the eight BN benchmarks that we considered and O(n 1.21 ) on a real drug design dataset characterized by almost 14, features. 2. An extension of HPC designed for the specific aim of features selection for probabilistic classification. Such extension is called MBOR and was already applied in [Aussem et al. (29c); de Morais & Aussem (28a,b)] with very promising results after extensive empirical experiments on synthetic and real-world datasets. MBOR searches the Markov boundary of a target as a solution for the problem of features selection and was shown to scale up to hundreds of thousands of variables. As the algorithm HPC, MBOR was also proven to be correct under the faithfulness condition. 3. A novel conservative features selection method for handling incomplete datasets [Aussem & de Morais (28)]. The method is conservative in the sense that it selects the minimal subset of features that renders the rest of the features independent of the target (the class variable) without making any assumption about the mechanism of missing data. The idea is that, when no information about the pattern of missing data is available, an incomplete dataset contains the set of all possible estimates. This conservative test addresses the main shortcoming of CB methods with missing data: the difficulty of performing an independence test when some entries are missing without making any assumption about the missing data mechanism. 4

15 1.3 Applications 4. A new graphical approach for exploiting data missingness in Bayesian network modeling [de Morais & Aussem (29a)]. The novel approach makes use of Bayesian networks for explicitly representing the information about the absence of data. This work focused on two different, but not independent aims: first, to help detecting the missing data mechanisms, and second, to improve accuracy in classification when working with missing data. The missingness information is taken into account in the structure of the Bayesian network that will represent the joint probability distribution of all the variables, including new dummy variables that were artificially created for representing missingness. 1.3 Applications The main applications of the methods presented in this thesis to real-world problems made by the author include: 1. Application of the algorithm HPC for extracting the relevant risk factors that are statistically associated with the Nasopharyngeal Carcinoma (NPC) [Aussem et al. (29a)]. Experiments for detecting the missing data mechanisms present in this dataset were also carried out. The dataset was obtained from a case-control epidemiologic study performed by the International Agency for Research on Cancer in the Maghreb (north Africa). It consists of 1289 subjects (664 cases of NPC and 625 controls) and 15 nominal variables. In this study, special emphasis is placed on integrating domain knowledge and statistical data analysis. Once the graph skeleton is constructed from data, it is afterwards directed by the domain expert according to his causal interpretation and additional latent variable are added to the graph for sake of clarity, coherence and conciseness. The graphical representation provides a statistical profile of the recruited population, and meanwhile help identifying the important risk factors involved in NPC. 2. Application of the algorithm MBOR on a microarray dataset in order to provide a robust analysis of type 2 diabetes [Aussem et al. (29c)]. The dataset used in this study consists of 22,283 genes and only 143 samples. It 5

16 1.4 Outline was obtained in collaboration with INSERM U87/INRA 1235 laboratory and represents a compilation of different microarray data published during the last five years on the skeletal muscle from patients suffering from type 2 diabetes, obesity or from healthy subjects. Multiple runs of MBOR on re-samples of the microarray data are combined, using ensemble techniques, to yield more robust results. Genes were aggregated into a consensus genes rank and the top ranked features were analyzed by biologists. It was shown that the findings presented in this study are in nice agreement with the genes that were associated with an increased risk of diabetes in the recent medical literature. 3. The algorithm HPC was applied for representing the statistical dependencies between 34 clinical variables among 15 obese women with various degrees of obesity. Features affecting obesity are of high current interest. Clinical data, such as patient history, lifestyle parameters and basic or even more elaborate laboratory analyses (e.g., adiposity, body fat distribution, blood lipid profile and adipocyte sizes) form a complex set of inter-related variables that may help better understand the pathophysiology of visceral obesity and provide guidance for its clinical management. In the work presented in this chapter bootstrapping method was used to generate more robust network structures. Statistical significance of edge strengths are evaluated using this approach. If an edge has a confidence above the threshold, it is included in the consensus network. This study made thorough use of integration of physiological expertise into the graph structure. 1.4 Outline This thesis is divided into 9 chapters. A great effort was made in order to provide self contain chapters. For this reason some redundant information can be seen from one chapter to another. However, the brief background provided in chapter 2 is necessary for everyone who is not familiar to Bayesian networks. Chapter 2 provides the important background about the principal concepts of Bayesian networks and constraint-based learning. In chapter 3, the algorithms HPC and 6

17 1.4 Outline MBOR are introduced. Furthermore, this chapter also presents the parallel approach for both algorithms HPC and MBOR and a thorough discussion about the main problems that plague CB Bayesian networks structure learning, including the problem of almost-deterministic relationships among variables. Chapter 4 introduces a novel conservative features selection method for handling incomplete datasets. A different approach is presented in chapter 5 which exploits data missingness in Bayesian network modeling. The last chapters of this thesis contain several applications to real-world problems. In chapter 6 the algorithm HPC was applied for extracting the relevant risk factors that are statistically associated with the Nasopharyngeal Carcinoma. In chapter 7 the algorithm MBOR (Chapter 3) is applied on a microarray dataset in order to provide a robust analysis of type 2 diabetes. A graphical representation for helping identifying the most important predictors of visceral obesity was achieved in chapter 8 by applying the algorithm HPC on a dataset containing 34 clinical variables among 15 obese women with various degrees of obesity. Finally, chapter 9 presents a summary and discusses future work. 7

18 Chapter 2 Background about Bayesian networks and structure learning 2.1 Introduction Bayesian networks (BN) are probabilistic graphical models that offer a coherent and intuitive representation of uncertain domain knowledge. Formally, BN are directed acyclic graphs (DAG) modeling probabilistic conditional independences among variables. The graphical part of BN reflects the structure of a problem, while local interactions among neighboring variables are quantified by conditional probability distributions. One of the main advantages of BN over other artificial intelligence (AI) schemes for reasoning under uncertainty is that they readily combine expert judgment with knowledge extracted from the data within the probabilistic framework. Another advantage is that they represent graphically the (possibly causal) independence relationships that may exist in a very parsimonious manner [Brown & Tsamardinos (28)]. Formally, a BN is a tuple < G, P >, where G =< U, E > is a directed acyclic graph with nodes representing the random variables U and P a joint probability distribution on U. In addition, G and P must satisfy the Markov condition: every variable, X i U, is independent of any subset of its non-descendant variables conditioned on the set of its parents, denoted by Pa G i. The analysis of the Bayesian network structure can give very important information for understanding a problem at hand. For instance, let us consider 8

19 2.1 Introduction Figure 2.1: Toy example of causal network presented in the WCCI28 Causation and Prediction Challenge. the causal network presented in figure 2.1. This network was presented as a toy example of causal network in the WCCI28 Causation and Prediction Challenge [Guyon et al. (28)]. When data is generated from a causal network, then such causal network very often coincides with the structure of a Bayesian network that represents the joint probability distribution of the variables in the problem. Clearly, the causal network of figure 2.1 is acyclic, therefore it is called a causal DAG. However, such DAG must satisfy the Markov condition in order to be a Bayesian network. The concept of causality is something rather controversial, but when one consider that an effect is a future consequence of a past cause, then the Markov condition is observed from a causal DAG. It means that when the empirical data is generated from a causal DAG G by a stochastic process, then G and P satisfy the Markov condition. In other words, if the value of each variable X i is chosen at random with some probability P (X i Pa G i ), based solely on the values of Pa G i, then the overal distribution P of the generated instances x 1, x 2,..., x n and the DAG G will satisfy the Markov condition [Pearl (2)]. A lot of information can be taken from the Bayesian network that coincides with the causal DAG presented in figure 2.1. For instance, it is clear that Smoking is directly associated with Lung Cancer. One can also see that even if Yellow Fingers is associated with Lung Cancer it is not a direct association, but it passes 9

20 2.2 Some principles of Bayesian networks through Smoking. It is clear that Born an Even Day has nothing to do with Lung Cancer. Interestingly Car Accident could be even more predictive in relation to Lung Cancer than Smoking because there are three information paths between Lung Cancer and Car Accident. Nonetheless, for a physician it would be much more important to discover that Smoking has a direct impact to developing Lung Cancer, than that Car Accident is predictive. One can see that Allergy is independent of Lung Cancer when there is no information about the values of the other variables, but when it is known that a patient is frequently Coughing, then the fact of knowing that the same patient has no Allergy can increase the probability of this patient having Lung Cancer... However, the structure of such a Bayesian network is not known beforehand when a dataset containing observational data is the only available piece of information. The Bayesian network structure search is the main aim of what is presented in this thesis. This chapter recalls some concepts of Bayesian networks and structure learning that are important for the comprehension of what is discussed in the sequel of this thesis. More information about Bayesian networks can be found for instance in [Neapolitan (24); Pearl (2)]. The contents of the next two sections were mostly taken from [Neapolitan (24)]. A thorough discussion on Bayesian networks can also be found in [François (26); Naïm et al. (24)]. 2.2 Some principles of Bayesian networks As it was already stated in the last section a BN is a tuple < G, P >, where G =< U, E > is a directed acyclic graph (DAG) with nodes representing the random variables U, arcs E the connections between the random variables and P a joint probability distribution on U. In addition, G and P must satisfy the Markov condition: every variable, X i U, is independent of any subset of its non-descendant variables conditioned on the set of its parents, denoted by Pa G i. From the Markov condition, it is easy to prove [Neapolitan (24)] that the joint probability distribution P on the variables on U can be factored as follows : P (U) = P (X 1,..., X n ) = n P (X i Pa G i ) (2.1) i=1 1

21 2.2 Some principles of Bayesian networks Equation 2.1 allows a parsimonious decomposition of the joint distribution P. It enables us to reduce the problem of determining a huge number of probability numbers to that of determining relatively few. Such decomposition is possible because a BN structure G entails a set of conditional independence assumptions. They can all be identified by the d-separation criterion [Pearl (2)]. We discuss this important concept next, but first we need to review some graph theory. Suppose we have a DAG G =< U, E >. We call a chain between two nodes (X, Y ) U a set of connections that create a path in G between the two nodes X and Y. For example, [Yellow Fingers, Smoking, Lung Cancer, Coughing, Allergy] and [Allergy, Coughing, Lung Cancer, Smoking, Yellow Fingers] represent the same chain between Yellow Fingers and Allergy in the DAG of figure 2.1. We often denote chains by showning undirected lines between the nodes in the chain. If we want to show the direction of the edges, we use arrows. A chain containing two nodes is called a link. Given the directed edge X Y, we say the tail of the edge is X and the head of the edge is Y. We also say the following: A chain X W Y is a head-to-tail meeting, the edges meet headto-tail at W, and W is a head-to-tail node on the chain. A chain X W Y is a tail-to-tail meeting, the edges meet tail-totail at W, and W is a tail-to-tail node on the chain. A chain X W Y is a head-to-head meeting, the edges meet headto-head at W, and W is a head-to-head node on the chain. A chain X W Y, such that X and Y are not adjacent, is an uncoupled meeting Markov condition and d-separation Consider three disjoint sets of variables, X, Y and Z, which are represented as nodes in a directed acyclic graph G. To test whether X is independent of Y given Z in any distribution compatible with G, we need to test whether the nodes corresponding to variables Z block (d-separate) all chains between nodes in X 11

22 2.2 Some principles of Bayesian networks and nodes in Y. Blocking is to be interpreted as stopping the flow of information (or of dependence) between the variables that are connected by such chains. Next we develop the concept of d-separation, and show the following: (1) The Markov condition entails that all d-separations are conditional independences, and (2) every conditional independence entailed by the Markov condition is identified by d-separation. That is, if < G, P > satisfies the Markov condition, every d-separation in G is a conditional independence in P. Furthermore, every conditional independence, which is common to all probability distributions satisfying the Markov condition with G, is identified by d-separation. Definition 1 Let G =< U, P > be a DAG, A U, X and Y be distinct nodes in (U \ A), and ρ be a chain between X and Y. Then ρ is blocked by A if one of the following holds: There is a node Z A on the chain ρ, and the edges incident to Z on ρ meet head-to-tail at Z. There is a node Z A on the chain ρ, and the edges incident to Z on ρ meet tail-to-tail at Z. There is a node Z on the chain ρ such that Z and all of Z s descendants are not in A and the edges incident to Z on ρ meet head-to-head at Z. We say the chain is blocked at any node in A where one of the above meetings takes place. There may be more than one such node. The chain is called active given A if it is not blocked by A. Definition 2 Let G =< U, P > be a DAG, A U, X and Y be distinct nodes in (U \ A). We say X and Y are d-separated by A in G if every chain between X and Y is blocked by A. It is not hard to see that every chain between X and Y is blocked by A if and only if every simple chain between X and Y is blocked by A. Definition 3 Let G =< U, P > be a DAG, A, B and C be mutually disjoint subsets of U. We say A and B are d-separated by C in G if for every X A and Y B, X and Y are d-separated by C. We right I G (A, B C). If C =, we write only I G (A, B). 12

23 2.2 Some principles of Bayesian networks We write I G (A, B C) because d-separation identifies all and only those conditional independences entailed by the Markov condition for G. The three following lemmas are need to prove this. Lemma 1 Let P be a probability distribution of the variables in U and G =< U, E > be a DAG. Then < G, P > satisfies the Markov condition if and only if, for every three mutually disjoint subsets A, B, C U, whenever A and B are d-separated by C, A and B are conditionally independent in P given C. That is, < G, P > satisfies the Markov condition if and only if I G (A, B C) I P (A, B C). Proof. The proof can be found in [Neapolitan (24)]. According to lemma 1, if A and B are d-separated by C in G, the Markov condition entails I P (A, B C). For this reason, if < G, P > satisfies the Markov condition, we say G is an independent map (I-map) of P. The question that rises now is if the converse of what was stated by lemma 1 is also true. The next two lemmas prove this. First we have a definition. Definition 4 Let U be a set of random variables, and A 1, B 1, C 1, A 2, B 2, and C 2 be subsets of U. We say conditional independence I P (A 1, B 1 C 1 ) is equivalent to conditional independence I P (A 2, B 2 C 2 ) if for every probability distribution P of U, I P (A 1, B 1 C 1 ) holds if and only if I P (A 2, B 2 C 2 ) holds. Lemma 2 Any conditional independence entailed by a DAG, based on the Markov condition, is equivalent to a conditional independence among disjoint sets of random variables. Proof. The proof can be found in [Neapolitan (24)]. Due to the preceding lemma, we need only discuss disjoint sets of random variables when investigating conditional independences entailed by the Markov condition. The next lemma states that the only such conditional independences are those that correspond to d-separations: Lemma 3 Let G =< U, E > be a DAG, and P be the set of all probability distributions P such that < G, P > satisfies the Markov condition. Then for every three mutually disjoint subsets A, B, C U, 13

24 2.2 Some principles of Bayesian networks I P (A, B C) for all P P I G (A, B C). Proof. The proof can be found in [Geiger & Pearl (199)]. Definition 5 We say conditional independence I P (A, B C) is identified by d- separation in G if one of the following holds: I G (A, B C). A, B and C are not mutually disjoint, A, B and C are mutually disjoint, I P (A, B C) and I P (A, B C ) are equivalent, and we have I G (A, B C ). Theorem 1 Based on the Markov condition, a DAG G entails all and only those conditional independences that are identified by d-separation in G. Proof. The proof follows immediately from the lemmas 1, 2 and 3. One must be careful to interpret theorem 1 correctly. A particular distribution P, that satisfies the Markov condition with G, may have conditional independences that are not identified by d-separation. The next definition is about the situation when the converse of theorem 1 is also true. Definition 6 Suppose we have a joint probability distribution P of the random variables in some set U and a DAG G =< U, E >. We say that < G, P > satisfies the faithfulness condition if, based on the Markov condition, G entails all and only conditional independences in P. That is, the following two conditions holds: < G, P > satisfies the Markov condition. This means G entails only conditional independences in P. All conditional independences in P are entailed by G, based on the Markov condition. When < G, P > satisfies the faithfulness condition, we say P and G are faithful to each other, and we say that G is a perfect map (P-map) of P. When < G, P > does not satisfy the faithfulness condition, we say they are unfaithful to each other. 14

25 2.2 Some principles of Bayesian networks Figure 2.2: Three Markov equivalent DAGs. There are no other DAGs Markov equivalent to them Markov equivalence Many DAGs are equivalent in the sense that they have the same d-separations. For example, each of the DAGs in figure 2.2 has the d-separations I G (Y, Z X), I G (X, W Y, Z) and these are the only d-separations each has. After stating a formal definition of this equivalence, we give a theorem showing how it relates to probability distributions. Finally, we establish a criterion for recognizing this equivalence. Definition 7 Let G 1 =< U, E 1 > and G 2 =< U, E 1 > be two DAGs containing the same set of nodes U. Then G 1 and G 2 are called Markov equivalent if for every three mutually disjoint subsets A, B, C U, A and B are d-separated by C in G 1 if and only if A and B are d-separated by C in G 2. That is I G1 (A, B C) I G2 (A, B C) Although the previous definition relates only to graph properties, its application is in probability, due to the following theorem: Theorem 2 Two DAGs are Markov equivalent if and only if, based on the Markov condition, they entail the same conditional independences. Proof. The proof follows immediately from theorem 1. Corollary 1 Let G 1 =< U, E 1 > and G 2 =< U, E 1 > be two DAGs containing the same random variables U. Then G 1 and G 2 are Markov equivalent if and 15

26 2.2 Some principles of Bayesian networks only if for every probability distribution P of U, (G 1, P ) satisfies the Markov condition if and only if (G 2, P ) satisfies the Markov condition. Proof. The proof follows immediately from theorems 1, 2. Next we present a theorem that shows how to identify Markov equivalence. Its proof requires the following three lemmas: Lemma 4 Let G =< U, E > be a DAG, and X, Y U. Then X and Y are adjacent in G if and only if they are not d-separated by any set V (U \ X, Y ). Proof. The proof can be found in [Neapolitan (24)]. Lemma 5 Suppose we have a DAG G =< U, E > and an uncoupled meeting X Z Y. Then the following are equivalent: X Z Y is a head-to-head meeting. There exists a set not containing Z that d-separates X and Y. All sets containing Z do not d-separate X and Y. Proof. The proof can be found in [Neapolitan (24)]. Lemma 6 If G 1 and G 2 are Markov equivalent, then X and Y are adjacent in G 1 if and only if they are adjacent in G 2. That is, Markov equivalent DAGs have the same links (edges without regard for directions). Proof. The proof can be found in [Neapolitan (24)]. We now give the theorem that identifies Markov equivalence. This theorem was first stated in [Pearl et al. (1989)]. Theorem 3 Two DAGs G 1 and G 2 are Markov equivalent if and only if they have the same links (edges without regard for direction) and the same set of uncoupled head-to-head meetings. Proof. The proof can be found in [Neapolitan (24)]. Theorem 3 gives us a simple way to represent a Markov equivalence class with a simple graph. That is, we can represent a Markov equivalence class with a graph that has the same links and the same uncoupled head-to-head meetings 16

27 2.2 Some principles of Bayesian networks as the DAGs in the class. Any assignment of directions to the undirected edges in this graph, that does no create a new uncoupled head-to-head meeting or a directed cycle, yields a member of the equivalence class. Often there are edges other than uncoupled head-to-head meetings which must be oriented the same in Markov equivalent DAGs. For example, if all uncoupled meeting X Y Z is not head-to-head, then all the DAGs in the equivalence class must have Y Z oriented as Y Z. So we define a DAG pattern for a Markov equivalence class to be the graph that has the same links as the DAGs in the equivalence class and has oriented all and only the edges common to all of the DAGs in the equivalence class. The directed links in a DAG pattern are called compelled edges Embedded Faithfulness The distribution P (v, s, l, f) in figure 2.3 (b) does not admit a faithful DAG representation. However, it is the marginal of a distribution, namely P (v, s, c, l, f), which does. This is an example of embedded faithfulness, which is defined as follows: Definition 8 Let P be a joint probability distribution of the variables in V where V U, and let G =< U, E > be a DAG. We say < G, P > satisfies the embedded faithfulness condition if the following two conditions holds: Based on the Markov condition, G entails only conditional independences in P for subsets including only elements of V. All conditional independences in P are entailed by G, based on the Markov condition. When < G, P > satisfies the embedded faithfulness condition, we say P is embedded faithfully in G. Notice that faithfulness is a special case of embedded faithfulness in which U = V. Theorem 4 Let P be a joint probability distribution of the variables in U with V U, and G =< U, E >. If < G, P > satisfies the faithfulness condition, and P is the marginal distribution of V, then < G, P > satisfies the embedded faithfulness condition. Proof. The proof comes directly from definition 8. 17

28 2.2 Some principles of Bayesian networks Figure 2.3: The marginal distribution of V, S, L and F cannot satisfy the faithfulness condition with any DAG. Theorem 5 Let P be a joint probability distribution of the variables in V with V U, and G =< U, E >. Then < G, P > satisfies the embedded faithfulness condition if and only if all and only conditional independences in P are identified by d-separation in G restricted to elements of V. Proof. The proof can be found in [Neapolitan (24)] Markov blankets and boundaries A Bayesian network can have a large number of nodes, and the probability of a given node can be affected by instantiating a distant node. However, it turns out that the instantiation of a set of close nodes can shield a node from the effect of all other nodes. The following definition and theorem show this: Definition 9 Let U be a set of random variables, P be their joint probability distribution, and X U. Then a Markov blanket M X of X is any set of variables such that X is conditionally independent of all the other variables given M X. That is, I P (X, U \ (M X X) M X ) Theorem 6 Suppose < G, P > satisfies the Markov condition. Then, for each variable X, the set of all parents of X, children of X and spouses of X is a Markov blanket of X. Proof. The proof can be found in [Neapolitan (24)]. 18

29 2.3 Constraint-based structure learning Definition 1 Let U be a set of random variables, P be their joint probability distribution, and X U. Then a Markov boundary MB X of X is any Markov blanket of X such that none of its proper subsets is a Markov blanket of X. Theorem 7 Suppose < G, P > satisfies the faithfulness condition. Then, for each variable X, the set of all parents of X, children of X and spouses of X is the unique Markov boundary of X. Proof. The proof can be found in [Neapolitan (24)]. Theorem 7 holds for all probability distributions including ones that are not strictly positive. When a probability distribution is not strictly positive, there is not necessarily a unique Markov boundary. The final theorem presented in this section holds for strictly positive distributions. Theorem 8 Suppose P is a strictly positive probability distribution of the variable in U. Then for each X U there is a unique Markov boundary of X. Proof. The proof can be found in [Pearl (1988)]. 2.3 Constraint-based structure learning The problem of learning the most probable a posteriori BN from data is worstcase NP-hard [Chickering (22); Chickering et al. (24)] and the recent explosion of high dimensional datasets poses a serious challenge to existing BN structure learning algorithms. Two types of BN structure learning methods have been proposed so far: score-and-search (a basic example is shown in algorithm 1) and constraint-based (a basic example is shown in algorithm 2) methods. While score-and-search methods are efficient for learning the full BN structure, the ability to scale up to hundreds of thousands of variables is a key advantage of constraint-based methods over score-and-search methods. The study presented in this thesis is focused on constraint-based approaches Soundness of constraint-based algorithms Constraint-Based (CB for short) learning methods systematically check the data for conditional independence relationships. Typically, the algorithms run a χ 2 19

30 2.3 Constraint-based structure learning Algorithm 1 DAG Pattern search: a basic score-and-search algorithm Require: D: dataset; U: set of random variables. Ensure: A DAG pattern (gp) that approximates maximizing score(d, gp). 1: E 2: gp (U, E) 3: repeat 4: if [any DAG pattern in the neighborhood of our current DAG pattern increases score(d, gp)] then 5: modify E according to the one that increases score(d, gp) the most 6: end if 7: until [score(d, gp) is not increased anymore] independence test when the dataset is discrete and a Fisher s Z-test when it is continuous in order to decide on dependence or independence, that is, upon the rejection or acceptance of the null hypothesis of conditional independence. A structure learning algorithm from data is said to be correct (or sound) if it returns the correct DAG pattern (or a DAG in the correct equivalence class) under the assumptions that the independence tests are reliable and that the learning dataset is a sample from a distribution P faithful to a DAG G, The (ideal) assumption that the independence tests are reliable means that they decide (in)dependence if and only if the (in)dependence holds in P. Based on what was just stated we next prove the soundness of algorithm 2. Lemma 7 If the set of all conditional independences in U admit a faithful DAG representation, the algorithm 2 creates a link between X and Y if and only if there is a link between X and Y in the DAG pattern gp containing the d-separations in this set. Proof. The algorithm 2 produces a link if and only if X and Y are not d- separated by any subset of U, which, owing to Lemma 4, is the case if and only if X and Y are adjacent in gp. Lemma 8 If the set of all conditional independences in U admit a faithful DAG representation, then any directed edge created by the algorithm 2 is a directed edge in the DAG pattern containing the d-separations in this set. 2

31 2.3 Constraint-based structure learning Algorithm 2 DAG Pattern search: a basic constraint-based algorithm Require: D: dataset; U: set of random variables. Ensure: DAG pattern (gp) such that I G (X, Y Z) I P (X, Y Z). Step 1: 1: for all [pair of nodes X, Y U] do 2: search for a subset S XY U such that I(X, Y S XY ); 3: if [no such set can be found] then 4: create the link X Y in gp; 5: end if 6: end for Step 2: 7: for all [uncoupled meeting X Z Y ] do 8: if [Z S XY ] then 9: orient X Z Y as X Z Y ; 1: end if 11: end for Step 3: 12: for all [uncoupled meeting X Z Y ] do 13: orient Z Y as Z Y ; 14: end for Step 4: 15: for all [link X Y such that there is a path from X to Y ] do 16: orient X Y as X Y ; 17: end for Step 5: 18: for all [uncoupled meeting X Z Y such that X W, Y W and Z W ] do 19: orient Z W as Z W ; 2: end for 21

32 2.3 Constraint-based structure learning Proof. We consider the directed edges created in each step in turns: Step 2: The fact that such edges must be directed follows from Lemma 5. Step 3: If the uncoupled meeting X Z Y were X Z Y, Z would not be in any set that d-separates X and Y due to Lemma 5, which means we would have created the orientation X Z Y in Step 2. Therefore, X Z Y must be X Z Y. Step 4: If X Y were X Y, we would have a directed cycle. Therefore, it must be X Y. Step 5: If Z W were Z W, then X Z Y would have to be X Z Y because otherwise we would have a directed cycle. But if this were the case, we would have created the orientation X Z Y in Step 2. So Z W must be Z W. Lemma 9 If the set of all conditional independences in U admit a faithful DAG representation, all the directed edges, in the DAG pattern containing the d-separations in this set, are directed by the algorithm 2. Proof. The proof can be found in [Meek (1995)]. Theorem 9 If the set of all conditional independences in U admit a faithful DAG representation, the algorithm 2 creates the DAG pattern containing the d- separations in this set. Proof. The proof follows from Lemmas 7, 8 and G likelihood-ratio conditional independence test Statistical tests are needed in order to verify the conditional independence I(X, Y Z) from data. One of the most used statistical tests of conditional independence between two categorical variables is the G likelihood-ratio conditional independence test. In this thesis it is used to determine I P (X, Y Z) from data [Spirtes et al. (2)]. The general formula for G is presented in equation