On learning of structures for Bayesian networks Timo Koski Mathematical statistics, KTH

Transcription

1 On learning of structures for Bayesian networks Timo Koski Mathematical statistics, KTH Sigtuna

2 Introduction Bayesian networks are a special case are multivariate (discrete) statistical distributions embodying a collection marginal and conditional independencies which may be represented by means of a directed acyclic graph.

3 A Bayesian Network W X Y Z W is a parent of X and Y, W and X are parents of Y, and X and Y are parents of Z. P (X, Y, Z, W) = P (Z X, Y) P (Y X, W) P (X W) P (W) We specify the simultaneous distribution by a set of simpler conditional distributions (modularity).

4 A Bayesian Network: learning structure W X Y Z W W X Y X Y Z Z

5 Graph Structure (DAG), random walk 1/5 1/6 omgivning omgivning

6 Model Choice The main challenges are: the increase in the number of graphical model structures as a function of the number of nodes.

7 Model Choice The main challenges are: the increase in the number of graphical model structures as a function of the number of nodes. the equivalence of the statistical models determined by different graphical models.

8 Robinson (1973) has given the following recursive function for computing the number f (d) of directed acyclic graphs with d nodes: f (d) = d ( ( 1) i+1 d i i=1 ) 2 i(d 1) f (d i). (1) This number grows exponentially, for d = 5 it is 29000, and for d = 10 it is approximately

9 Model choice There are two approaches of model selection for Bayesian networks: maximize a score function, e.g., the posterior probability of the graph P (G X), where G is a graph and X is a set of data (a sample of a set of random variables). The algorithms search the space of all possible structures (in fact, one should search over representors of equivalence classes, as discussed in the sequel).

10 Model choice There are two approaches of model selection for Bayesian networks: maximize a score function, e.g., the posterior probability of the graph P (G X), where G is a graph and X is a set of data (a sample of a set of random variables). The algorithms search the space of all possible structures (in fact, one should search over representors of equivalence classes, as discussed in the sequel). constraint based: algorithms estimate from data whether certain conditional independencies between the variables hold.

11 Algebraic statistics Bayesian networks are families of probability distributions, which are real algebraic varieties (or actually semi-algebraic sets): they are zeros of some polynomials in the probability simplex. L.D. Garcia, M. Stillman, and B. Sturmfels: Algebraic geometry of Bayesian networks. Journal of Symbolic Computation, 39, 2005, pp

12 Algebraic statistics & Model Choice L.D. Garcia: Algebraic statistics in Model Selection Uncertainty in Artificial Intelligence, UAI the correct dimension of a model when applying Bayesian Information Criterion (BIC) for model selection with hidden variables. BIC = an asymptotic expansion of P (G X).

13 Notations Consider the random variables X 1,..., X d. Each variable X i assumes values in the discrete and finite alphabet X i. The set of all possible configurations of X 1,..., X d is denoted by X = d i=1 X i. X i = n i. Let the generic element of X be denoted by x. R X denotes the real vector space of d dimensional tables of the format n 1 n 2... n d.

14 Joint probability, Polynomials We assign a joint probability to x X, P (x) = P (X 1 = x 1,...,X d = x d ) = p x1...x d We can regard p x1...x d as indeterminates (real numbers) that generate the ring R [X] of polynomial functions on the space of tables R X. Actually we need not restrict them to be probabilities.

15 Conditional Independence Let X,Y and Z be pairwise disjoint subsets of the random variables X 1,...,X d. We say that X and Y are conditionally independent given Z if P (X = x,y = y Z = z) = P (X = x Z = z) P (Y = y Z = z) for all x,y,z. We write this with the symbol X Y Z

16 Conditional Independence (1) B. Sturmfels 1 observed that X Y Z p xyz p x y z p x yz p xy z = 0 where p xyz e.t.c. are obtained by marginalization from the joint probability p x1...x d. Hence p xyz e.t.c. are polynomials of degreee 1 in R (X). 1 Solving Systems of Polynomial Equations, AMS, 2002

17 Conditional Independence (2) Therefore the cross product differences (cpd) cpd := p xyz p x y z p x yz p xy z correspond to homogeneous quadratic polynomials. Hence all the statements of conditional independence for a distribution P translate into a system of homogeneous quadratic polynomials (Sturmfels, loc.cit. 2002, Theorem 8.1.). We write I X Y Z for the ideal generated by these polynomials.

18 Independence Models Bayesian networks, as many other statistical models, can be described by a finite list of conditional independence statements, M = {X (1) Y (1) Z (1), X (2) Y (2) Z (2),...,X (m) Y (m) Z (m) } and the ideal of the model M is I M = I X (1) Y (1) Z (1) + I X (2) Y (2) Z (2) I X (m) Y (m) Z (m)

19 Independence Variety I M = I X (1) Y (1) Z (1) + I X (2) Y (2) Z (2) I X (m) Y (m) Z (m) The independence variety is V (I M ) is the set of common (complex) zeros of the polynomials in I M, or, equivalently, the set of all n 1 n 2... n d tables which satisfy the conditional independence statements in M.

20 Algebraic statistical model Let now V (I M + < p x1...x d 1 >) denote the subset of the independence variety V (I M ) which consists of non-negative tables whose entries sum to one. This is the subset of the probability simplex specified by the model M.

21 Notations A graph G is a pair (V, E), where E V V \{(i, i) V V } is the set of ordered pairs of nodes denoted as edges. If both (i, j) E and (j, i) E then there is an undirected edge between i and j written as i j. if (i, j) E but (j, i) E there is a directed edge between i and j written as i j. path is a sequence v 0,..., v n of different nodes such that v i v j and (v i 1, v i ) E for all i, j = 1,..., n. A cycle is a path with the only exception that v 0 = v n. A directed cycle is a cycle which has at least one directed edge.

22 Bayesian Network Let G = (V,E) be a directed acyclic graph (DAG) and the set of nodes V = {1,...,d} index the random variables X 1,...,X d 2. Let P be a joint proability distribution of X 1,...,X d. We say that the pair (G, P) is a (discrete)bayesian Network if (G, P) satisfies the following independence model a.k.a. the (directed) local Markov property local (G) = {X i nd(x i ) \ pa(x i ) pa(x i ),i = 1,2,...,d} where nd(x i ) is the set of nondescendants of X i and pa(x i ) is the set of parents of X i. 2 Convention: We are going to identify nodes of a graph with random variables.

23 Bayesian Network local (G) = {X i nd(x i ) \ pa(x i ) pa(x i ),i = 1,2,...,d} X j is a nondescendant of X i if there is no directed path from X i to X j.

24 Bayesian Network local (G) = {X i nd(x i ) \ pa(x i ) pa(x i ),i = 1,2,...,d} X j is a nondescendant of X i if there is no directed path from X i to X j. pa(x i ) is the set of parents of X i.

25 Global Markov and d-separation There may be more conditional independencies in (G,P) than those listed in local (G). The (directed) global Markov property of (G,P) is the list of independence statements where global (G) = {X Y Z X Y G Z }, X Y G Z says that X and Y are d -separated by Z. d-separation is a graphical algorithm to find conditional independence statements in a Bayesian network.

26 d-separation X and Y are d -separated by Z, if and only if all trails from X to Y are blocked by Z. A trail π from X i to X j is blocked in G by (the nodes in) Z if there is a node X b such that either X b Z and arrows of the trail π do not meet head-to-head at X b or X b / Z and X b has no descendants in Z, and arrows of π do meet head-to-head at X b.

27 d-separation

28 Faithfulness We should assume that (G,P) is such that X Y G Z X Y Z This is the starting point of the constraint based methods of model choice.

29 Factorization (1) Assume that Then we set pa(x j ) = {X i1,x i2,...,x ir } q j x 0 x 1 x r = P (X j = x 0 X i1 = x 1,X i2 = x 2,...,X ir = x r ) Let E denote the set of these unknowns q j x 0 x 1 x r, j = 1,2,...,d and let R[E] denote the polynomial ring they generate.

30 Factorization (2) φ : R E R X is defined by d j=1 q j x 1 x r φ px1...x d The image of φ is contained in the independence variety V global(g) The ring homomorphism Φ : R [X] R[E]/J, where J =< q j P x1 x r 1 > p x1...x d d j=1 q j x 1 x r

31 Factorization Theorem Theorem The following four subsets of the probability simplex coincide: ( V I local(g) + < ) p x1...x d 1 > = V ( I global(g) + < ) p x1...x d 1 > = = V (ker (Φ)) = image (φ). This is an improvement of a factorization theorem due to Steffen Lauritzen et.al. expressed in terms of algebraic statistics, and is due to L.D.Garcia, M. Stillman & B. Sturmfels: Algebraic geometry of Bayesian networks. Journal of Symbolic Computation, 39, 2005, pp

32 Markov equivalence There may be several graphs G such that (G,P) is a Bayesian network for a given P. In other words, G 1 and G 2 may have global(g 1 ) = global(g 2 ) for a given P. This cannot be ignored in the construction of algorithms. We shall discuss this within the context of scoring algorithms by introduction of partially directed acyclic graphs.

33 Markov equivalence A partially directed acyclic graph (PDAG) 3 is a graph G that does not contain any directed cycles. A directed acyclic graph (DAG) is a PDAG with only directed edges and an undirected graph (UG) is a PDAG with only undirected edges. M. Frydenberg: The Chain Graph Markov Property. Scand. J. Stat., 17, , S.A.Andersson, D. Madigan & M.D. Perlman: A Characterization of Markov Equivalence Classes for Acyclic Digraphs. The Annals of Statistics, 25, , a.k.a. CHAIN GRAPH

34 Conditional Independence Statements for PDAGs Lauritzen, Wermuth and Frydenberg (LWF) Markov properties. P,L,G relative to a (DAG, UG, PDAG) are defined as 1 the pairwise Markov property P, if for any pair (i,j) of nodes (i,j),(j,i) E with j nd(i), i j nd(i)\{i,j}. 2 the local Markov property L, for any i V, i nd(i)\bd(i) bd(i). 3 the global Markov property G, if for any triple of disjoint subsets (A,B,S V ) such that S separates A from B in (G An(A B S) ) m, A B S.

35 Markov equivalence A complex in a PDAG is a sequence of nodes v 1,...,v k,k 3 such that v 1 v 2,v i v i+1 for i = 2,...,k 2,v k 1 v k and there are no other edges (v i,v j ) E for all 0 i,j k. v 1 v k v v 3 v k 2 v k 1 A Complex

36 Markov equivalence The following is proved in (Frydenberg 1990), see also (Andersson et al 1997). Theorem Two PDAGs have the same (LWF) Markov properties iff they have the same undirected graph and the same complexes.

37 Markov equivalence b b b b a d a d a d a d c c c c D 1 D 2 D 3 D 4 { D D D } D 4

38 Markov equivalence A graph G = (V,E) is larger than the graph Ĝ = (V,Ê) if G might have edges where Ĝ has arrows. Theorem (Frydenberg 1990) For any PDAG G there exists a unique PDAG G = (Ṽ,Ẽ) with the same Markov properties as G such that G is larger than any other graph Ĝ = (ˆV,Ê) that has the same Markov properties as G.

39 Markov equivalence Each Markov-equivalence class is uniquely determined by a single PDAG (a chain graph), simultaneously equivalent to each graph in the class, hence the learning of structure should be based on these representatives. The question is, how to construct them.

40 Markov equivalence M. Volf and M. Studený: A graphical characterization of the largest chain graphs. International Journal of Approximate Reasoning 20, pp , Volf and Studený define a descending path, if either i i + 1 or i i + 1 for all i = 1,...,n 1. an ancestor of a node j as a node i such that there exists a descending path from i to j in G.

41 Markov equivalence A directed edge X i X j covers X y X z if X i is an ancestor of X y and X z is an ancestor of X j in a PDAG. An edge is protected iff it covers a directed edge which belongs to a complex. u x v y u v covers x y

42 Markov equivalence Theorem (Volf & Studený) A LPDAG is the largest PDAG of a Markov equivalent set of PDAGs iff every directed edge is protected. The largest graph is constructed by undirecting every non-protected arrow.

43 Markov equivalence b b b b a d a d a d a d c c c c D 1 D 2 D 3

44 Markov equivalence Note however that it is possible to represent a set of Markov properties in a PDAG that cannot be represented in a DAG or UG, such as Hence we need Gröbner bases for I M to check this. For small d the known number of LPDAGs that cannot be represented by UGs is 0 for d = 2,3 (Volf and Studený 1999). For d = 4,5 the percentage of LPDAGs that cannot be represented as UGs is 6 and 22.

45 Markov equivalence The presence of equivalence suggests the use of search space S, the states of which are equivalence classes represented by its LPDAG, that is, graphical models that all have the same Markov properties according to Definition LWF. Then the size of a search space, where the states are equivalence classes, is smaller than, e.g., the space of all DAGs for d 10 (Gillispie and Perlman 2001).

46 Milan Studený and his co-workers are currently developing (have already developed (?)) and algebraic representation, using objects called imsets, of the equivalence classes that so that an equivalence class corresponds uniquely to (a long) vector (with lots of zeros). Then the polytope generated by the imsets may be the basis of learning Bayesian networks by linear programming. J. Vomlel & M. Studený: Graphical and Algebraic Representatives of Conditional Independence Models. home.html

47 Marginal data distribution In the current approach the next issue is to compute the marginal data distribution P (X G) taking into account the equivalence classes. We follow/present the solution due to due to The symbol X designates a set of data, or, r samples of each of the variables X 1,...,X d, i.e., X = { x (l)} r l=1, x(l) X, with neither missing values nor hidden variables. For a subset A V, the corresponding random variable is X A = (X i ) i A.

48 Factorization Theorem Theorem (Frydenberg 1990) A probability distribution on a discrete and finite sample space with strictly positive density P, w.r.t. a product measure, satisfies P (see Definition LWF) over a PDAG G if and only if it factorizes as P(x) = P ( ) x τ x bd(τ), (2) τ T (G) where each factor P ( x τ x bd(τ) ) further factorizes along the undirected closure graph G m cl(τ).

49 Hyper Dirichlet priors A clique is a complete subgraph G A, such that there exists no (other) complete subgraph G A, A A V such that G A G A. A hyper-dirichlet density is defined for given real positive numbers λ c = {λ c,xc } xc X c let D(λ c ) for θ c by the density π (θ c λ c ) = Γ(λ 0) γ p θ λc,xc 1 c,x c (3) x c X c on the set { θ c x c X c θ c,xc = 1,θ c,xc > 0 }, where Γ(x) is the Euler gamma function, λ 0 = x c X c λ c,xc,γ p = x c X c Γ(λ c,xc ).

50 Let G be a PDAG, and let us set P (X θ,g) = r P τ T (G) l=1 ( ) x l τ θ τ,x l bd(τ), by (2), which is formally rewritten to include the parameters.

51 Next note that in (2) the probabilities P ( ) x τ θ τ,x bd(τ) satisfy the LWF Markov property P on the undirected closure graph G m cl(τ). Recall that a probability P(x θ) > 0 satisfying the property P on an undirected graph factorizes as c C P(x θ) = θ c,x c s S θ, s,x s where C {A A V } is the set of cliques and S {A A V } is the multiset (including repetitions of elements) of separators (for two cliques C i and C j the separator S = C i C j ), respectively. Thus ( ) r l=1 P x l τ θ τ,x l bd(τ) is a function of a product of expressions of the following form r θ(x l c) = l=1 θc,x nc,xc c, x c X c where { n c,xc } is the number of times the configuration x c occurs in r X c = x (l) c l=1.

52 On any clique c in G m cl(τ) we introduce the integration with respect to D(λ c,i ) as follows, c,x c π (θ c λ c ) dθ c. {θ c P θ nc,xc xc Xc θc,xc =1,θ c,xc >0} x c X c

53 By standard properties of the Dirichlet integral one gets from (3) that P c (X c ) = Γ(λ 0) Γ(n c,xc + λ c,xc), (4) Γ(r + λ) Γ(λ c,xc ) x c X c where r + λ = x c X c (n c,xc + λ c,xc ).

54 Hence, if C(τ) and S(τ) are the set of cliques and separators in the undirected graph G τ r ( ) P x l τ x l c C(τ) bd(τ) = P c (X c ) s S(τ) P s (X c ). (5) l=1 The probability P s (X c ) can be constructed for a separator by marginalizing over a clique that includes S. For the validity of this argument and the properties of the marginal data distribution, see (Dawid and Lauritzen 1993). The expression (5) must be multiplied over all the chain components τ in order to yield the overall marginal data distribution

55 P (X G) = τ T (G) c C(τ) P c (X c ) s S(τ) P s (X c ). (6) From (6) we define the final scoring function as the posterior probability P (G X) = P (X G)P (G) (7) P (X G)P (G). G S For a given set of nodes V let S = {LPDAG} be the search space. One particular goal for the topology learning can be stated as the identification of a structure G opt S having highest posterior probability (7), i.e. G opt arg max G S P (G X).

56 As noted by (Andersson et al 1997), the factors in (2) P ( x τ x bd(τ) ) satisfy also P on Gτ, i.e., the graph induced by the chain component 4 τ. When the so-called hyper-dirichlet distributions are used as priors the marginal data likelihood has the LWF Markov properties over graphs (Dawid and Lauritzen 1993). 4 the connected components of a graph with arrows removed

57 We need to construct an irreducible Markov chain searching the space. For PDAGs we observe the next result, which is Proposition 4.5 of (Andersson et al 1997). Theorem Consider two graphs G and H with same set of nodes in S = {LPDAG}. Then there exists a finite sequence of LPDAGs G G 1,...,G k H such that each consecutive pair G i,g i+1 differs by either (i) exactly one undirected edge, (ii) exactly by one directed edge, or (iii) by two edges that form an immorality.

58 Non-reversible MCMC An algorithm, based on parallell Markov chains G j = {G tj } with the transition kernels, with the probability of a transition from a current state S to a proposed (by the mechanism above) new state S, as ( ) min 1, p(x G ). p(x G) The chains draw at independent random times a value from P (G j X). This forces the chains to move to regions of maximal posterior probability.

59 The Parallel Search Define the sequence of probabilities {α t,t = 2,3,...} according to α t = 1 q log t, where q 1 can be chosen suitably, for instance q [5,10]. Z 0 = 0, and P(Z t = 1) = α t,p(z t = 0) = 1 α t, independently for t = 1,2,....

60 The Parallel Search For each t = 0, 1,..., define the distribution P t (G tj ) = p(x G tj) m j=1 p(x G t), over the space of the current states {G t1,g t2,...,g tm }. For each t = 0, 1,... such that Z t = 1, the transition to the next state is determined according to this distribution, such that the next state for each chain is sampled to the non-reversible proposal-acceptance formulae, independently for j = 1,..., m.

61 The Parallel Search For each t, such that Z t = 0, transition to the next state S (t+1)j is determined according to the non-reversible proposal-acceptance formulae above independently for j = 1,...,m.

62 Table: Prognostic factors in coronary heart disease. B yes no F E D C A no yes no yes neg < 3 < 140 no yes > 140 no yes > 3 < 140 no yes > 140 no yes pos < 3 < 140 no yes > 140 no yes > 3 < 140 no yes > 140 no yes

63 Table: Explanations of the Labels in Table 1 Label Meaning Range A smoking no,yes B strenuous mental work no,yes C strenuous physical work no,yes D systolic blood pressure < 140, > 140 E ratio of β and α lipoproteins < 3,> 3 F family anamnesis 5 of coronary heart disease no,yes 5 information concerning a medical patient and his/her background for use in analysis of her/his condition

64 The optimal equivalence class, having the estimated posterior probability.32 F B E C A D

65 The optimal equivalence class, having the estimated posterior probability.32 and the cliques {F }, {B,C}, {A,C,E}, {A,D,E}.

66 The method enables also consistent estimation of the marginal posterior probabilities of any edges being present in the graphs, that is we estimate the marginal probabilities as {G=(V,E) {UG} G S ˆP(e X) = P(G X) t,e E} {G {UG} G S P(G X). t} (c.f. M Koivisto: Advances in Exact Bayesian Structure Discovery in Bayesian Networks 2006.) These are illustrated in Table 3 where the probability of adjacency is given for all pairs of the considered variables.

67 Table: The estimated marginal posterior probabilities of edges for the coronary heart disease data. A B C D E F A B C D E F

68 Graph Structure Logarithm of the marginal data distribution Iterations

69 For the special case of decomposable UGs, (Frydenberg and Lauritzen 1989) shows that search operators adding and deleting of single edges in decomposable UGs is sufficient to guarantee irreducibility. Corollary The Markov chain G (i) with transition mechanism defined by the kernel, Algorithm and the search operator in the Algorithm is irreducible with respect to S ={UG}.

70 The example above F B E C A D P ψ BC ψ CAE ψ EAD ψ F

71 Joint work with: parts of the talk are based on a paper to appear in Data Mining and Knowledge Discovery, 2008, joint work with Jukka Corander and Magnus Ekdahl.