Learning Hierarchical Bayesian Networks for Large-Scale Data Analysis Kyu-Baek Hwang 1, Byoung-Hee Kim 2, and Byoung-Tak Zhang 2 1 School of Computing, Soongsil University, Seoul 156-743, Korea kbhwang@ssu.ac.kr 2 School of Computer Science and Engineering Seoul National University, Seoul 151-742, Korea bhkim@bi.snu.ac.kr, btzhang@cse.snu.ac.kr Abstract. Bayesian network learning is a useful tool for exploratory data analysis. However, applying Bayesian networks to the analysis of large-scale data, consisting of thousands of attributes, is not straightforward because of the heavy computational burden in learning and visualization. In this paper, we propose a novel method for large-scale data analysis based on hierarchical compression of information and constrained structural learning, i.e., hierarchical Bayesian networks (HBNs). The HBN can compactly visualize global probabilistic structure through a small number of hidden variables, approximately representing a large number of observed variables. An efficient learning algorithm for HBNs, which incrementally maximizes the lower bound of the likelihood function, is also suggested. The effectiveness of our method is demonstrated by the experiments on synthetic large-scale Bayesian networks and a real-life microarray dataset. 1 Introduction Due to their ability to caricature conditional independencies among variables, Bayesian networks have been applied to various data mining tasks [9], [4]. However, application of the Bayesian network to extremely large domains (e.g., a database consisting of thousands of attributes) still remains a challenging task. General approach to structural learning of Bayesiannetworks, i.e., greedy search, encounters the following problems when the number of variables is greater than several thousands. First, the amount of running time for the structural learning is formidable. Moreover, greedy search is likely to be trapped in local optima, because of the increased search space. Until now, several researchers suggested the methods for alleviating the above problems [5], [8], [6]. Even though these approaches have been shown to efficiently find a reasonable solution, they have the following two drawbacks. First, they are likely to spend lots of time to learn local structure, which might be I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 670 679, 2006. c Springer-Verlag Berlin Heidelberg 2006
Learning Hierarchical Bayesian Networks for Large-Scale Data Analysis 671 less important in the viewpoint of grasping global structure. The second problem is about visualization. It would be extremely hard to extract useful knowledge from a complex network structure consisting of thousands of vertices and edges. In this paper, we propose a new method for large-scale data analysis using hierarchical Bayesian networks. It should be noted that introducing hierarchical structures in modeling is a generic technique. Several researchers have introduced the hierarchy to probabilistic graphical modeling [10], [7]. Our approach is different from theirs in the purpose of hierarchical modeling. The purpose of our method is to make it feasible to apply probabilistic graphical modeling to extremely large domain. We also propose an efficient learning algorithm for hierarchical Bayesian networks having lots of hidden variables. The paper is organized as follows. In Section 2, we define the hierarchical Bayesian network (HBN) and describe its property. The learning algorithm for HBNs is described in Section 3. In Section 4, we demonstrate the effectiveness of our method through the experiments on various large-scale datasets. Finally, we draw the conclusion in Section 5. 2 Definition of the Hierarchical Bayesian Network for Large-Scale Data Analysis Assume that our problem domain is described by n discrete variables, Y = {Y 1,Y 2,..., Y n }. 1 The hierarchical Bayesian network for this domain is a special Bayesian network, consisting of Y and additional hidden variables. It assumes a layered hierarchical structure as follows. The bottom layer (observed layer) consists of the observed variables Y. The first hidden layer consists of n/2 hidden variables, Z 1 = {Z 11,Z 12,..., Z 1 n/2 }. The second hidden layer consists of ( n/2 )/2 hidden variables, Z 2 = {Z 21,Z 22,..., Z 2 ( n/2 )/2 }.Finally, the top layer (the log 2 n -th hidden layer) consists of only one hidden variable, Z log2 n = {Z log2 n 1}. We indicate all the hidden variables as Z = {Z 1, Z 2,..., Z log2 n }. 2 Hierarchical Bayesian networks, consisting of the variables {Y, Z}, have the following structural constraints. 1. Any parents of a variable should be in the same or immediate upper layer. 2. At most, one parent from the immediate upper layer is allowed for each variable. Fig. 1 shows an example HBN consisting of eight observed and seven hidden variables. By the above structural constraints, a hierarchical Bayesian network represents the joint probability distribution over {Y, Z} as follows. 1 In this paper, we represent a random variable as a capital letter (e.g., X, Y,and Z) and a set of variables as a boldface capital letter (e.g., X, Y, and Z). The corresponding lowercase letters denote the instantiation of the variable (e.g., x, y, and z) or all the members of the set of variables (e.g., x, y, andz), respectively. 2 We assume that all hidden variables are also discrete.
672 K.-B. Hwang, B.-H. Kim, and B.-T. Zhang Hidden layer 3 Z 31 Hidden layer 2 Z 21 Z 22 Hidden layer 1 Z 11 Z 12 Z 13 Z 14 Observed layer Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Fig. 1. An example HBN structure. The bottom (observed) layer consists of eight variables describing the problem domain. Each hidden layer corresponds to a compressed representation of the observed layer. P (Y, Z) =P (Z log2 n ) log 2 n 1 i=1 P (Z i Z i+1 ) P (Y Z 1 ). (1) An HBN is specified by the tuple S h, Θ Sh. Here, S h denotes the structure of HBN and Θ Sh denotes the set of the parameters of local probability distributions given S h. In addition, we denote the parameters for each layer as {Θ Sh Y, Θ Sh Z 1,..., Θ Sh Z log2 n } (= Θ S h ). The hierarchical Bayesian network can represent hierarchical compression of the information contained in Y = {Y 1,Y 2,..., Y n }. The number of hidden variables comprising the first hidden layer is half of n. Under the structural constraints, each hidden variable in the first hidden layer, i.e., Z 1i (1 i n/2 ), can have two child nodes in the observed layer as depicted in Fig 1. 3 Here, each hidden variable corresponds to a compressed representation of its own children if the number of possible values of it is less than the number of possible configurations of its children. In the hierarchical Bayesian network, edges between the variables of the same layer are also allowed (e.g., see hidden layers 1 and 2 in Fig. 1). These edges encode the conditional (in)dependencies among the variables in the same layer. The conditional independencies among the variables in a hidden layer correspond to a rough representation of the conditional independencies among the observed variables because each hidden variable is a compressed representation for a set of observed variables. When we deal with a problem domain consisting of thousands of observed variables, an approximated probabilistic dependencies visualized through hidden variables can be a reasonable solution for exploratory data analysis. 3 If n is odd, all the hidden variables except one can have two child nodes. In this case, the last one has only one child node.
Learning Hierarchical Bayesian Networks for Large-Scale Data Analysis 673 3 Learning the Hierarchical Bayesian Network Assume that we have a training dataset for Y consisting of M examples, i.e., D Y = {y 1, y 2,..., y M }. We could describe our learning objective (log likelihood) as follows. M M L(Θ Sh,S h )= log P (y m Θ Sh,S h )= log m=1 m=1 Z P (Z, y m Θ Sh,S h ), (2) where Z means summation over all possible configurations of Z. General approach to finding maximum likelihood solution with missing variables, i.e., the expectation-maximization (EM) algorithm, is not applicable here because the number of missing variables amounts to several thousands. The large number of missing variables would render the solution space infeasible. Here, we propose an efficient algorithm for maximizing the lower bound of Eqn. (2). The lower bound for the likelihood function is derived by Jensen s inequality as follows. M log Z m=1 M P (Z, y m Θ Sh,S h ) log P (Z, y m Θ Sh,S h ). (3) m=1 Z Further, the term for each example, y m, in the above equation can be decomposed as follows. log P (Z, y m Θ Sh,S h )= log P (y m Θ Sh Y,S h ) P (Z y m, Θ Sh \Θ Sh Y,S h ) Z Z = C 0 log P (y m Θ Sh Y,S h )+ log P (Z y m, Θ Sh \Θ Sh Y,S h ), (4) Z where C 0 is a constant which is not related to the choice of Θ Sh and S h.in Eqn. (4), the parameter sets Θ Sh Y and Θ Sh \Θ Sh Y can be learned separately given S h. 4 Our algorithm starts by learning Θ Sh Y and the substructure of S h related to only the parents of Y. After that, we fill missing values for the variables in the first hidden layer, making the training dataset for Z 1. 5 Now, Eqn. (4) can be more decomposed as follows. C 0 log P (y m Θ Sh Y,S h )+ Z log P (Z y m, Θ Sh \Θ Sh Y,S h ) = C 0 log P (y m Θ Sh Y,S h )+C 1 log P (z 1m Θ Sh Z 1,S h ) + log P (Z\Z 1 y m, z 1m, Θ Sh \{Θ Sh Y, Θ Sh Z 1 },S h ), (5) Z\Z 1 4 In this paper, the symbol \ denotes set difference. 5 Because hidden variables are all missing, this procedure is likely to produce hidden constants by maximizing the likelihood function. We apply an encoding scheme for preventing this problem, which will be described later.
674 K.-B. Hwang, B.-H. Kim, and B.-T. Zhang Table 1. Description of the two-phase learning algorithm for hierarchical Bayesian networks. Here, hidden layer 0 means the observed layer and the variable set Z 0 means the observed variable set Y. Input D Y = {y 1, y 2,..., y M} - the training dataset. Output A hierarchical Bayesian È Å «network Sh, Θ S h, maximizing M the lower bound of m=1 log P (ym ΘS,S h h). The First Phase -Forl =0to log 2 n 1 - Estimate the mutual information between all possible variable pairs, I(Z li ; Z lj ), in hidden layer l. - Sort the variable pairs in decreasing order of mutual information. - Select n 2 (l+1) variable pairs from the sorted list such that each variable is included in only one variable pair. - Set each variable in hidden layer (l + 1) as the parent of a selected variable pair. - Learn the parameter set Θ Sh Z l by maximizing È M m=1 log P (z lm Θ Sh Z l,s h ). - Generate the dataset D Zl+1 based on the current HBN. The Second Phase - Learn the Bayesian È network structure inside hidden layer l M by maximizing m=1 log P (z lm Θ Sh Z l,s h ). where C 1 is a constant which is not related to the optimization. Then, we could learn Θ Sh Z 1 and related substructure of S h by maximizing log P (z 1m Θ Sh Z 1,S h ). In this way, we could learn the hierarchical Bayesian network from bottom to top. We propose a two-phase learning algorithm for hierarchical Bayesian networks based on the above decomposition. In the first phase, a hierarchy for information compression is learned. From the observed layer, we choose variable pairs sharing common parents in the first hidden layer. Here, we should select the variable pair with high mutual information for minimizing information loss. After determining the parent for each variable pair, missing values for the hidden parent variable are filled as follows. First, we estimate the joint probability distribution for the variable pair, ˆP (Yj,Y k )(1 j, k n, j k), from the given dataset D Y. Then, we set the parent variable value for the most probable configuration of {Y j,y k } as 0 and the second probable one as 1. 6 By this encoding scheme, a parent variable could represent the two most probable configurations of its child variables, minimizing the information loss. The parent variable value for other two cases are considered as missing. Now, we could learn the parameters for the observed variables, Θ Sh Y, using the standard EM algorithm. After learning Θ Sh Y, we could fill the missing values by probabilistic inference, making a complete dataset for the variables in the first hidden layer. Now, the same procedure 6 Here, we assume that all variables are binary although our method could be extended to more general cases.
Learning Hierarchical Bayesian Networks for Large-Scale Data Analysis 675 can be applied for Z 1, learning the parameter set Θ Sh Z 1 and generating a complete dataset for Z 2. This process is iterated, building the hierarchical structure and making the complete dataset for all variables, {Y, Z}. After building the hierarchy, we learn the edges inside a layer when necessary (the second phase). Any structural learning algorithm for Bayesian network can be employed because a complete dataset is now given for the variables in each layer. Table 1 summarizes the two-phase algorithm for learning HBNs. 4 Experimental Evaluation 4.1 Results on Synthetic Datasets To simulate diverse situations, we experimented with the datasets generated from various large-scale Bayesian networks having different structural properties. They are categorized into scale-free [2] and modular structures. All variables were binary and local probability distributions were randomly generated. Here, we show the results on two scale-free and modular Bayesian networks, consisting of 5000 nodes. 7 They are shown in Fig. 2(a) and 2(b), respectively. Training datasets, having 1000 examples, were generated from them. The first phase of HBN learning algorithm was applied to the training datasets, building hierarchies. Then, the second phase was applied to the seventh hidden layer, consisting of 40 hidden variables. The learned Bayesian network structures inside the seventh hidden layer are shown in Fig. 2(c) and 2(d). We examined the quality of information compression. Fig. 3(a) shows the mutual information value between parent nodes in the upper layer and child nodes in the lower layer, averaged across a hidden layer. 8 Here, we can observe that the amount of shared information between consecutive layers is more than 50%, 9 although it decreases as the level of hidden layer goes up. Interestingly, the hierarchical Bayesian network can preserve more information in the case of the dataset from modular Bayesian networks. To investigate this further, we estimated the distribution of the mutual information (see Fig. 3(b)). Here, we can clearly observe that more information is shared between parent and child nodes in the case of the modular Bayesain network than in the case of the scale-free Bayesian network. Based on the above experimental results, we conclude that the hierarchical Bayesian network can efficiently represent the complicated information contained in a large number of variables. In addition, HBNs are more appropriate when the true probability distribution assumes a modular structure. We conjecture that this is because a module in the lower layer can be well represented by a hidden node in the upper layer in our HBN framework. 7 Results on other Bayesian networks were similar although not shown here. 8 Here, mutual information values were scaled into [0, 1], by being divided by the minimum of the entropy of each variable. 9 It means that the scaled mutual information value is greater than 0.5.
676 K.-B. Hwang, B.-H. Kim, and B.-T. Zhang Pajek Pajek (a) (b) H16 H10 H12 H36 H8 H7 H6 H40 H12 H7 H11 H3 H15 H24 H5 H15 H17 H14 H22 H9 H13 H9 H23 H16 H37 H11 H8 H37 H26 H23 H18 H5 H40 H2 H6 H10 H39 H19 H1 H2 H21 H18 H17 H13 H22 H14 H25 H34 H38 H39 H20 H4 H30 H36 H24 H4 H19 H1 H32 H25 H27 H30 H35 H26 H38 H27 H20 H33 H31 H3 H29 H35 H21 H28 H34 H33 H28 H32 Pajek (c) H31 H29 Pajek (d) Fig. 2. Scale-free (a) and modular (b) Bayesian networks, consisting of 5000 nodes, which generated the training datasets. The Bayesian network structures, inside the seventh hidden layer, learned from the training datasets generated from the scale-free (c) and modular (d) Bayesian networks. These network structures were drawn by Pajek software [3]. 4.2 Results on a Real-Life Microarray Dataset A real-life microarray dataset on the budding yeast cell-cycle [11] was analyzed by hierarchical Bayesian networks. The dataset consists of 6178 genes and 69 samples. We binarized gene expression level based on the median expression of each slide sample. Among the 6178 genes, we excluded genes with low information entropy (< 0.8). Finally, we analyzed a binary dataset consisting of 6120 variables and 69 samples. General tendency with respect to the information compression was similar to the case of synthetic datasets, although not shown here. In the seventh hidden layer, consisting of 48 variables, we learned a Bayesian network structure (see Fig. 4). This Bayesian network compactly visualizes the original network structure consisting of 6120 genes. From the network structure, we can easily find a set of hub nodes, e.g., H2, H4, H8, H28, and H30. Genes comprising these hub nodes and their biological role were analyzed. The function of genes can be described by Gene Ontology (GO) [1] annotations. GO
Learning Hierarchical Bayesian Networks for Large-Scale Data Analysis 677 Average MI between parent and child nodes 0.8 0.75 0.7 0.65 0.6 0.55 0.5 Scale free Bayesian network Modular Bayesian network Density 0 2 4 6 8 10 12 14 Scale free Bayesian network Modular Bayesian network 0.45 0 1 2 3 4 5 6 7 8 Hidden layer (a) 0.5 0.6 0.7 0.8 0.9 MI between parent and child nodes (b) Fig. 3. Quality of information compression by hierarchical Bayesian network learning. Mutual information between parent nodes in the upper layer and child nodes in the lower layer (a) and its distribution (b). H45 H18 H14 H13 H6 H11 H17 H21 H7 H9 H20 H29 H25 H10 H15 H3 H22 H37 H2 H19 H24 H12 H48 H47 H8 H30 H31 H26 H43 H34 H5 H4 H28 H27 H16 H46 H42 H38 H40 H23 H36 H44 H35 H33 H32 H39 H1 H41 Pajek Fig. 4. The Bayesian network structure in the seventh hidden layer consisting of 48 variables. This network structure approximately represents the original network structure consisting of 6120 yeast genes. Here, we can easily find some hub nodes, for example, H2, H4, H8, H28, and H30. The network structure was drawn by Pajek software [3]. maps each gene or gene product to directly related GO terms, which have three categories: biological process (BP), cellular compartment (CC), and molecular function (MF). We can conjecture the meaning of each hub using this annotation, focusing on the BP terms related to the cell-cycle. For this task, we used GO Term Finder (http://db.yeastgenome.org/cgi-bin/go/gotermfinder), which looks for significantly shared GO terms that are directly or indirectly related to the given list of genes. The result are summarized in Table 2. The closely located hub nodes, H4 and H8 (see Fig. 4), share the function of cellular
678 K.-B. Hwang, B.-H. Kim, and B.-T. Zhang Table 2. Gene function annotation of hub nodes in the learned Bayesian network consisting of 48 variables. The significance of a GO term was evaluated by examining the proportion of the genes associated to this term, compared to the number of times that term is associated with other genes in the genome (p-values were calculated by a binomial distribution approximation). Node name GO term Frequency p-value H2 Organelle organization and biogenesis 18.7% 0.09185 H4 Cellular physiological process 74.0% 0.01579 H8 Cellular physiological process 74.2% 0.01357 H28 Response to stimulus 14.8% 0.00514 H30 Cell organization and biogenesis 29.3% 0.02804 physiological process. The genes in H30 share more specific function than H4 and H8, namely cell organization and biogenesis. The hub node H2 is related to organelle organization and biogenesis, which is more specific than that of H30. The genes in H28, the most crowded hub in the network structure, respond to stress or stimulus such as nitrogen starvation. 5 Conclusion We proposed a new class of Bayesian networks for analyzing large-scale data consisting of thousands of variables. The hierarchical Bayesian network is based on hierarchical compression of information and constrained structural learning. Through the experiments on datasets from synthetic Bayesian networks, we demonstrated the effectiveness of hierarchical Bayesian network learning with respect to information compression. Interestingly, the degree of information conservation was affected by structural property of the Bayesian networks which generated the datasets. The hierarchical Bayesian network could preserve more information in the case of modular networks than in the case of scale-free networks. One explanation for this phenomenon is that our HBN method is more suitable for modular networks because a variable in the upper layer could well represent a set of variables contained in a module in the lower layer. Our method was also applied to analysis of real microarray data on yeast cell-cycle, consisting of 6120 genes. We were able to obtain a reasonable approximation, consisting of 48 variables, on the global gene expression network. A hub node in the Bayesian network consisted of genes with similar functions. Moreover, neighboring hub nodes in the learned Bayesian network also shared similar functions, confirming the effectiveness of our HBN method for real-life large-scale data analysis. Acknowledgements This work was supported by the Soongsil University Research Fund and by the Korea Ministry of Science and Technology under the NRL program.
Learning Hierarchical Bayesian Networks for Large-Scale Data Analysis 679 References 1. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel- Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene Ontology: tool for the unification of biology. Nature Genetics 25(1) (2000) 25-29 2. Barabási, A.-L. and Albert, R.: Emergence of scaling in random networks. Science 286(5439) (1999) 509-512 3. Batagelj, V. and Mrvar, A.: Pajek - program for large network analysis. Connections 21(2) (1998) 47-57 4. Friedman, N.: Inferring cellular networks using probabilistic graphical models. Science 303(6) (2004) 799-805 5. Friedman, N., Nachman, I., Pe er, D.: Learning Bayesian network structure from massive datasets: the sparse candidate algorithm. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI) (1999) 206-215 6. Goldenberg, A. and Moore, A.: Tractable learning of large Bayes net structures from sparse data. Proceedings of the Twentifirst International Conference on Machine Learning (ICML) (2004) 7. Gyftodimos, E. and Flach, P.: Hierarchical Bayesian networks: an approach to classification and learning for structured data. Lecture Notes in Artificial Intelligence 3025 (2004) 291-300 8. Hwang, K.-B., Lee, J.W., Chung, S.-W., Zhang, B.-T.: Construction of large-scale Bayesian networks by local to global search. Lecutre Notes in Artificial Intelligence 2417 (2002) 375-384 9. Nikovski, D.: Constructing Bayesian networks for medical diagnosis from incomplete and partially correct statistics. IEEE Transactions on Knowledge and Data Engineering 12(4) (2000) 509-516 10. Park, S. and Aggarwal, J.K.: Recognition of two-person interactions using a hierarchical Bayesian network. Proceedings of the First ACM SIGMM International Workshop on Video Surveillance (IWVS) (2003) 65-76 11. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B.: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9(12) (1998) 3273-3297