Gene duplication and hierarchical modularity in intracellular interaction networks

Transcription

1 BioSystems 74 (2004) Gene duplication and hierarchical modularity in intracellular interaction networks Jennifer Hallinan ARC Centre for Bioinformatics, Institute for Molecular Biosciences, The University of Queensland, Brisbane 4072, Qld, Australia Received 13 June 2003; received in revised form 29 January 2004; accepted 2 February 2004 Abstract Networks of interactions evolve in many different domains. They tend to have topological characteristics in common, possibly due to common factors in the way the networks grow and develop. It has been recently suggested that one such common characteristic is the presence of a hierarchically modular organization. In this paper, we describe a new algorithm for the detection and quantification of hierarchical modularity, and demonstrate that the yeast protein protein interaction network does have a hierarchically modular organization. We further show that such organization is evident in artificial networks produced by computational evolution using a gene duplication operator, but not in those developing via preferential attachment of new nodes to highly connected existing nodes Elsevier Ireland Ltd. All rights reserved. Keywords: Modularity; Hierarchical; Network; Intracellular; Yeast; Gene duplication 1. Introduction Networks of interactions between agents arise in a wide variety of contexts, including social networks (Newman, 2001), the Internet (Albert et al., 1999; Huberman and Adamic, 1999) and the world wide web (Kleinberg and Lawrence, 2002; Flake et al., 2002), ecological networks (Williams and Martinez, 2000), and intracellular interaction networks (Bhalla and Iyengar, 1999; Uetz et al., 2000; Sole and Pastor-Santorros, 2002; Jeong et al., 2000). Analysis reveals that these diverse networks frequently have topological and dynamic features in common, and it has been suggested that these commonalities arise Corresponding author. Tel.: ; fax: address: j.hallinan@imb.uq.edu.au (J. Hallinan). from similar processes operating during the evolution and development of the networks. Topological characteristics which are common to many naturally occurring networks include a scale-free pattern of connectivity. A scale-free network has no characteristic number of connections per node as does a randomly constructed network; the probability P(k) of finding a node with k connections follows a power law: P(k) k γ, (1) where the scaling exponent, γ, varies with the degree distribution of the network. When the degree, k,ofthe nodes of a scale-free network is plotted against the probability of occurrence of that degree, P(k), on a log log scale the data forms a straight line, the slope of which is γ. Interaction networks may also exhibit small-world properties (Watts, 1999). A small-world network has a /$ see front matter 2004 Elsevier Ireland Ltd. All rights reserved. doi: /j.biosystems

2 52 J. Hallinan / BioSystems 74 (2004) small but significant number of short-cut connections between otherwise widely separated nodes. This organization leads to characteristic topological features, including a small diameter, where diameter is defined as the longest of the shortest paths between every pair of nodes in the network. Small-world networks also have a large average cluster coefficient, C, compared with randomly connected networks with the same number of nodes and links. Cluster coefficient is a measure of the extent to which the neighbors of a node are linked to each other: C = 1 n n i=1 C i N i (N i 1)/2, (2) where, n is the number of nodes in the network, C i is the number of connections between neighbors of node I, and N i is the number of neighbors of node i (Watts, 1999). Many networks further appear to be organized into a number of modules. A module is generally defined as a subnetwork of a graph, the nodes of which have more connections to other nodes within the module than to external nodes (see, for example, Ancel and Fontana, 2002; Calabretta et al., 1998; Csete and Doyle, 2002; Rives and Galitski, 2003). The identification of modules within a network is an NP-complete problem (Flake et al., 2002). In practice, a number of algorithms have been used for the identification of modules in networks. One approach involves the analysis of flux modes (the smallest subnetworks enabling the metabolic system to operate in steady state) within the network. Many polynomial time algorithms exist for finding the maximum flow that can be routed from a source node, to a sink node, while obeying all capacity constraints (Flake et al., 2002; Stelling et al., 2002). This approach requires a seed node with which to initialize the algorithm. Another approach to module detection relies upon the identification of nodes or links which lie between modules. Snel et al. (2002) define such linkers as orthologous groups with mutually exclusive associations, and split the network at the linkers to produce modules which appear to be biologically plausible. Similarly, Schuster et al. (2002) split the network at nodes which have more than a threshold number of links, on the contention that such highly connected hubs must be external to the modules. They used a threshold number of links of four, but point out that other values may be useful, depending upon the size of the subnets produced. Module identification can be approached as a form of cluster analysis. Hierarchical clustering algorithms are widely used, even by researchers who are not interested in the cluster tree itself. The cluster tree is simply thresholded at an arbitrary depth in order to determine the final clusters (Ravasz and Barabasi, 2003). Girvan and Newman (2002) produced a cluster tree by identifying links with high betweenness (Freeman, 1977) and iteratively removing the links with the highest betweenness to produce a cluster tree. An interesting approach was taken by Holme et al. (2002), who combined the node-removal approach with the betweenness measure to develop an algorithm in which nodes of high betweenness are iteratively removed to deconstruct the network. It has recently been suggested that in addition to a modular organization, biological networks tend to have a hierarchical structure, in which nodes are organized into small modules which are, in turn, organized into larger modules, and so on (Rives and Galitski, 2003). These authors propose a method for the identification of hierarchical modularity which does not require the identification of individual modules. They derive a scaling law for the connectivity of nodes in a hierarchically modular network C(k) k 1 where, C(k) is the cluster coefficient defined in Eq. (2). Networks whose C(k) distributions fit this curve are held to be hierarchically modular. Ravasz et al. (2002) have identified hierarchical modularity in the metabolic networks of 43 different organisms. While the scaling law provides a simple means of identifying hierarchical modularity in a network, it offers no insights into the form of that modularity and, hence, does not contribute to a detailed analysis of network structure. All of the algorithms discussed above rely upon user judgement either to chose the threshold at which the network is fragmented or to validate the biological plausibility of the modules. Since the algorithms are topology-based, an objective, topology-based measure of the goodness of a module would be a valuable addition to the module detection algorithms. In this paper we describe a new algorithm for the

3 J. Hallinan / BioSystems 74 (2004) detection of modularity, in conjunction with an objective, topology-based measure of the coherence of the modules detected. These tools are combined to produce a coherence profile which can be used to visualize the extent of hierarchical modularity of a network, compare the modular structure of networks, and identify the threshold at which a network has maximum modular coherence. The major evolutionary operators which have been implicated in the evolution of scale-free networks are the preferential attachment of new nodes to highly connected existing nodes (Albert and Barabasi, 2000) and the noisy duplication of existing nodes ( gene duplication ; Pastor-Satorras et al., 2002). The relative importance of these operators to the development of real networks is unclear, and probably differs from network to network. Although both of these operators have been demonstrated to produce a scale-free pattern of connectivity, their effect upon the modularity of the network topology has not previously been investigated. We use the coherence profile algorithm to examine networks evolved according to several different published algorithms and compare the modularity of the resulting networks with that of the best-characterized biological network, the protein protein interaction network of the yeast Saccharomyces cerevisiae. 2. Network generation 2.1. The yeast protein protein interaction network Probably the best-characterized subcellular interaction network is the protein protein interaction network of the bakers yeast, S. cerevisiae. High-throughput methods for the collection of yeast protein protein interaction data have been developed over the last 5 years (Fields and Song, 1989), and large interaction databases exist on the Web. The data in these databases is known to be both noisy and incomplete (von Meering et al., 2002). Both false negatives (interactions which exist in vivo, but have not been picked up by the screens) and false positives (interactions which occur under the particular conditions of a yeast two-hybrid screen, but not otherwise) will occur, to an unknown extent. Further, the network which can be constructed from the yeast two-hybrid data is a static snapshot of interactions, with none of the dynamic, temporal qualities of the network in the living cell. These problems mean that considerable care must be taken to choose the most reliable data with which to work, and care must be taken not to over-interpret the results of work done using yeast two-hybrid interaction data. In an effort to use only the most reliable data available, the dataset used for these experiments is the core set of S. cerevisiae protein protein interactions identified by Deane et al. (2002) from the Database of Interacting Proteins (DIP database; This data is a subset of the entire DIP database consisting of those interactions which the authors verified using two forms of computational assessment, and is, therefore, less likely to contain false positive relationships than is the DIP database as a whole, although false negatives (missing interactions) undoubtedly occur. The core dataset comprises 3003 interactions between 1788 proteins (an average connectivity of 1.7). It does not form a single connected component, however; there are 139 components, of which the largest has 1471 proteins and 2770 interactions (average connectivity 1.9). This largest connected component was used for all investigations (Fig. 1) Network models Since we are interested in the evolution of biological interaction networks, the yeast protein protein interaction network was used as the gold standard network for these experiments. In order to compare the effects of different evolutionary operators, we used different operators to generate networks with the same general characteristics as the yeast network. The yeast network, and probably many other biological interaction networks, have three major characteristics: 1. A power law connectivity with a well-defined cutoff. The distribution of connectivity within the network follows a power law. A truly scale-free network obeys this distribution over a wide range of connectivities. Naturally occurring networks, however, tend to deviate from the power law at the extremes of the distribution, probably because of physical factors affecting nodes: people form

4 54 J. Hallinan / BioSystems 74 (2004) Fig. 1. The largest connected component of the curated yeast dataset. In this diagram the circles represent proteins and the lines represent interactions between proteins. relatively fewer new relationships as they age; proteins have physical limitations to the number of binding sites they can support (Amaral et al., 2000), and so on. The yeast network displays such a cutoff at the tail of the distribution. 2. Sparse average connectivity. Although the range of connectivities is wide, most naturally occurring networks have an average connectivity of around The average connectivity of the core yeast network is Small-world characteristics. Small-world networks are characterized by a small diameter relative to the number of nodes in the network, and a large cluster coefficient in comparison with that of a randomly connected network of the same size and average connectivity. We generated networks with characteristics as close as possible to the size and average connectivity of the yeast protein protein interaction network, using two published algorithms which have been demonstrated to produce scale-free networks: gene duplication (Pastor-Satorras et al., 2002); and preferential attachment (Ravasz et al., 2002). In addition, we generated randomly connected networks with approximately the same size and average connectivity as the yeast network. Five networks were generated using each algorithm. The size, average connectivity, average diameter, and average cluster coefficient of these networks are described in Table The preferential attachment model Scale-free networks were generated using the algorithm described by Albert and Barabasi (2000). In this algorithm, a network grows by the addition of new nodes to an existing node k i with probability, Π, proportional to the connectivity k i of node i: k i + 1 Π(k i ) = j (k (3) j + 1) Albert and Barabasi s model produces scale-free networks only for a subset of possible values of the parameters, p and q (see Albert and Barabasi, 2000 for a full analysis of the behavior of the algorithm). The network analysis program Pajek (Batagelj and Mrvar, 1998) incorporates an implementation of Albert and Barabasi s algorithm, with default parameter values which will produce a scale-free network. Starting from these defaults (m 0 (initial number of nodes) = 3, m (nodes added at each time step) = 2, p (probability of adding a link) = , q = (probability of rewiring a link) ), we iteratively modified the parameter values until we obtained scale-free networks which also had an average connectivity as close as possible

5 J. Hallinan / BioSystems 74 (2004) Table 1 Characteristics of the networks used in the project Network Nodes Edges Connectivity Diameter Cluster coefficient Mean S.D. Mean S.D. Mean S.D. Mean S.D. Mean S.D. Yeast N/A N/A 1.88 N/A N/A N/A Random Preferential attachment Gene duplication All values are averaged over five networks, except for the yeast network, which is the largest connected component of the core yeast protein protein interaction network. to that of the yeast network. The final parameters used were m 0 = 2, m = 1, p = 0.333, q = Fig. 2 shows a typical example of a network grown using the preferential attachment algorithm Gene duplication Gene duplication has been an important factor in the evolution of many organisms (Lynch, 2002). We used a network generation algorithm based upon gene duplication, in which a gene is interpreted as a node in the network (Pastor-Satorras et al., 2002). At each time step a node is selected at random and duplicated, together with all of its links to other nodes. Links associated with the new node are then added with probability α, or deleted with probability δ. The gene duplication model tends to generate networks with a large number of single, unconnected nodes. The average connectivity of the largest connected component of the network is, therefore, considerably higher than the value for the network as a whole. In addition, the highly stochastic nature of the algorithm means that the average connectivity of the network varies considerably from run to run of the algorithm, particularly when generating a relatively small network. Table 2 summarizes the results of 100 runs of the algorithm with the parameters described in Pastor-Satorras et al. (2002). The average connectivity of the largest connected component in a gene duplication network is dependant upon the link deletion parameter δ. It proved Fig. 2. The scale-free network generated using the preferential attachment algorithm.

6 56 J. Hallinan / BioSystems 74 (2004) Table 2 Node and link statistics for whole network and largest connected component of the same network, averaged over 100 runs of the gene duplication algorithm with δ = 0.562, N = 2000 and k =2.5 Nodes Links Average connectivity Mean S.D. Mean S.D. Mean S.D. Whole net Largest CC impossible to find a value of δ which would produce a giant component corresponding to that of the yeast network. A very high value of δ yields a highly fragmented network with no single large connected component, while lower δ produced larger components with connectivity somewhat higher than that of the yeast network. A value of δ (0.75) which would result in a giant component with average connectivity of around 2.5 (within the range of average connectivities reported for real networks) was selected empirically. In order to generate a largest connected component of a useful size, networks of 10,000 nodes were generated and the largest connected component extracted. Fig. 3 shows the largest connected component of a typical network generated using the gene duplication algorithm Random networks Control networks were generated with the appropriate numbers of nodes and links, connected at random so that the resulting network has approximately the same average connectivity as the yeast network. These random networks do not have scale-free connectivity, and would not be expected to display any significant modularity. An example of a random network is shown in Fig Quantifying hierarchical modularity 3.1. Iterative vector diffusion The iterative vector diffusion algorithm operates in the context of a graph G consisting of a vertex set V(G) = {v 1...v n } and an edge set E(G) = {e 1...e m } where each edge consists of two vertices. The algorithm is initialized by assigning to each vertex a binary vector of length n, initialized to { 0,i j v i,j = 1,i= j where, i is an index into the vector and j is the unique number assigned to a given node. This generates an initial set of n orthogonal vectors. The algorithm proceeds iteratively. At each iteration an edge from the network is selected at random and the vectors associated with each of its nodes are moved towards each other by adding a small amount, δ, to each element of the vector. This vector diffusion process is iterated until a stopping criterion is met. We chose to compute a maximum number of iterations as the stopping criterion. This number, n, is dependant upon both the number of connections in the network, c, and the size of δ, such that ( α ) n = c, δ Fig. 3. A network generated using the gene duplication algorithm. There are 943 nodes and 2500 links. where, α is the average amount by which a vector is changed in the course of the run. A value for α of

7 J. Hallinan / BioSystems 74 (2004) Fig. 4. A random network with 1432 nodes and 2770 edges. 0.1 was selected empirically in trials on artificially generated networks. At the end of the vector diffusion process the vectors, initially mutually orthogonal, are clustered in n-dimensional space. To reduce the dimensionality of the data set, the vectors are then subjected to hierarchical clustering using the hierarchical clustering algorithm implemented by Eisen et al. (1998). This algorithm uses the Pearson correlation coefficient as a distance metric. It calculates the distance matrix for all members of the input set of vectors and then uses an agglomerative hierarchical algorithm to create a hierarchical cluster tree, in which the two closest items in the set are joined by a node of the tree, and the two items replaced by a single item representing the new node. The process iterates until only one item remains Modular coherence The problem of identifying modules in a network is essentially an unsupervised cluster analysis task. Nodes are identified as belonging to a given module, or cluster, on the basis of their closeness to other nodes as assessed using an appropriate metric. There are many cluster analysis algorithms, several of which have been applied to the module detection task, as discussed above. Most clustering algorithms, however, will identify clusters in any dataset, whether or not these have any correspondence to real groupings in the 3.2. Cluster thresholding The output of the cluster algorithm is a binary tree, with a single root node giving rise to two offspring nodes, each of which give rise to two child nodes of their own, and so on. The tree can, therefore, be thresholded at various levels (two parents, four parents, eight parents, etc.; see Fig. 5) and the modularity of the network at each level can be examined. Fig. 5. Thresholding a cluster tree. (a) Tree thresholded at parent level 2 produces two clusters, (b) the same tree thresholded at parent level 3 has four clusters.

8 58 J. Hallinan / BioSystems 74 (2004) dataset. In order to validate the output of a clustering algorithm, practitioners often examine measures such as inter- and intra-cluster variance. Such measures are not easily applied to nodes on a graph. We propose a measure of modular coherence, which measures the relative proportions of inter- and intra-module links and assigns a value in the range 1 (no coherence) to +1 (a fully connected, stand-alone subgraph). The coherence, χ, of a previously identified module can be defined as ( ) 2ki χ = 1 n ( ) kji (4) n(n 1) n k jo + k ji j=1 where, k i is the total number of edges between nodes in the module, n is the number of nodes in the network, k ji is the number of edges between node j and other nodes within the module, and k jo is the number of edges between node j and other nodes outside the module. The first term in this equation is simply the proportion of possible links between the nodes comprising the module which actually exist; a measure of the connectivity within the module. The second term is the average proportion of edges per node which are internal to the module. A highly connected node with few external edges will, therefore, have a lower value of χ than a highly connected node with many external edges. χ will have a value in the range ( 1, +1). The concept of modularity in a network leads naturally to the question of scale. At what scale should modularity be sought? It is important that any characteristic scale for modularity in the network arise from the data, rather than being imposed by the investigator, since the appropriate scale cannot be determined a priori. This consideration has led to the concept of hierarchical modularity: the idea that network modules can occur at a range of scales, with modules higher up the hierarchy divided into smaller modules, and so on. Holme et al. (2002) consider a fundamental question in biological network analysis to be: what the hierarchical organization of subnetworks looks like. Rather than making an a priori decision about the scale at which network modularity should be analyzed, we propose an approach which provides an overview of the degree of modularity present in a given network at every possible scale of modularity. The resulting graph facilitates visual inspection of the modularity of the network over all possible scales, and permits the selection of a specific characteristic scale of the network for further analysis, if required. We call this approach a coherence profile. At each level in the hierarchy the number of modules and the average modular coherence of the network was computed. Average coherence was then plotted against threshold level to produce the coherence profile summarizing the hierarchical modularity of the network. 4. Results The coherence profile for the yeast network is shown in Fig. 6. It is immediately apparent that the yeast network has significant positive modular coherence over most of the range of thresholds. At low threshold values, corresponding to a partitioning of the network into a small number of relatively large modules, average coherence dips below 0, indicating that the modules have more external than internal connectivity. This is because a clustering algorithm is part of the module identification algorithm. Any clustering algorithm will identify clusters in whatever data it is given; whether or not they reflect real modules in the biological network. Only when the measured modular coherence is positive can confidence be placed in the biological reality of the modules. The spurious nature of the results at high threshold levels is also indicated by the sudden increase in standard deviation at the point where modular coherence drops below zero. These modules are illusory. The yeast network has approximately equal coherence over most thresholds, indicating that the network has a strongly hierarchically modular organization. In contrast to the yeast network, the random network (Fig. 7) shows negative coherence over most of its range. Although the clustering algorithm is still identifying modules, as expected, they have no coherence, have more external than internal edges, and do not fit the definition of a module given earlier. At the higher threshold levels, coherence rises slightly above zero. Inspection of the clustered network reveals that the modules detected at the extreme of the graph are an artefact of the unequal lengths of the branches of the

9 J. Hallinan / BioSystems 74 (2004) Threshold Fig. 6. Coherence profile for the core S. cerevisiae protein protein interaction network. The data is the mean of 100 runs of the algorithm on the same network. Dashed lines represent ±1 standard deviation Threshold Fig. 7. Coherence profile for random networks with approximately the same number of nodes and edges as the yeast network. The data is the mean of 100 runs of the algorithm for each of five randomly generated networks. Dashed lines represent ±1 standard deviation. cluster tree. At the extremes of the tree, there tends to be one large cluster and a number of very small (one or two node) clusters. Within the large cluster most edges will lie between nodes in the same cluster. The number of tiny clusters, most of whose edges connect to nodes external to the cluster, is too small to drive the mean coherence below zero. The random network, therefore, shows no evidence of hierarchical modularity, or, indeed, of significant modularity at any level. The fact that the yeast network displays hierarchical modularity, while the random network does not, provides a benchmark against which to assess the biological plausibility of the network evolution algorithms discussed in the introduction. The coherence profiles for the preferential attachment and gene duplication networks are shown in Figs. 8 and 9, respectively. The preferential attachment algorithm has been shown to produce scale-free networks (Albert and Barabasi, 2000). Fig. 8 shows, however, that these networks exhibit no sign of modularity at any level of the hierarchy. In contrast, the networks generated by the gene duplication algorithm have a coherence profile very similar to that of the yeast protein protein interaction network, with significant modular coherence present at almost every level of the hierarchy. It appears that gene duplication is more likely to produce a

10 60 J. Hallinan / BioSystems 74 (2004) Threshold Fig. 8. Coherence profile for the preferential attachment algorithm. The data is the mean of 100 runs of the algorithm for each of five randomly generated networks. Dashed lines represent ±1 standard deviation Threshold Fig. 9. Coherence profile for the gene duplication algorithm. The data is the mean of 100 runs of the hierarchical modularity detection algorithm for each of five randomly generated networks. Dashed lines represent ±1 standard deviation. hierarchically modular network than is preferential attachment. Preferential attachment is a feasible mechanism for the evolution of some networks, such as social networks, in which an already popular individual is likely to be sought out by new members of the social group. In a biological context, however, preferential attachment appears less plausible. There is no particular reason why a newly-evolved protein should bind more readily to a protein which already binds to several other partners. Gene duplication, however, has been shown to be important in evolution. A newly duplicated gene already has a functional output, which is usually a protein. This protein, however, is free of the selection pressure which acts upon its parent, since the parent still exists and fills its original function. One copy of the gene is, thus, free to mutate and change its function. It is known that the yeast genome has undergone several episodes of complete duplication in the course of its evolutionary history (Wagner, 2001). Gene duplication would, therefore, appear to be a plausible mechanism by which biological networks may have evolved. Gene duplication has previously been shown to produce scale-free networks in silico (Pastor-Satorras et al., 2002), and we show here that it also produces hierarchically modular networks, very similar in profile to the yeast network. There are other aspects of the topology of the yeast protein protein interaction network which appear to

11 J. Hallinan / BioSystems 74 (2004) be consistent with an origin by gene duplication. In order to produce a network as sparsely connected as a typical intracellular interaction network, which tend to have an average connectivity in the range , a large proportion of the duplicated edges (in our study 0.75) must be deleted, while relatively few are added. The algorithm, therefore, tends to produce highly fragmented networks containing a large number of individual nodes unconnected to any other nodes. This pattern of connectivity is evident in yeast. The core yeast dataset contains only 1788 of the 6223 proteins encoded by the yeast genome. The number of false negatives in the dataset (genuine interactions which have not yet been detected) is currently unknown. However, several genome-wide scans for protein protein interactions have been performed (Schwikowski et al., 2000; Legrain and Selig, 2000; Deane et al., 2002), and it is unlikely that the majority of interactions have been missed. The number of genuinely isolated proteins in the yeast network appears to be consistent with the gene duplication algorithm. The major problem with the gene duplication algorithm used here is the difficulty of evolving a network with a single connected component of a size comparable with that of the largest connected component of the yeast network. Increasing the total number of nodes generated increases the number of isolated nodes much more rapidly than the size of the largest connected component. It can be seen from Table 1 that the gene duplication networks were, in general, smaller and more highly connected than the other networks. These results are consistent with the hypothesis that gene duplication, while important, is not the only factor in yeast evolution; a suggestion with which most biologists would heartily agree. References Albert, R., Barabasi, A.-L., Topology of evolving networks: local events and universality. Phys. Rev. Lett. 85, Albert, R., Jeong, H., Barabasi, A.-L., Internet: diameter of the world-wide web. Nature 401, Amaral, L.A.N., Scala, A., Barthelemy, M., Stanley, H.A., Classes of small-world networks. Proc. Natl. Acad. Sci. U.S.A. 97, Ancel, L.W., Fontana, W., Evolutionary lock-in and the origin of modularity in RNA structure. In: Callabaut, W., Rasskin-Gutman, D. (Eds.), Modularity. Understanding the development and evolution of complex natural systems. Cambridge, MA, MIT Press. Bhalla, U.S., Iyengar, R., Emergent properties of networks of biological signaling pathways. Science 283, Calabretta, R., Nolfi, S., Parisi, D., Wagner, G.P., A case study of the evolution of modularity: Towards a bridge between evolutionary biology, artificial life, neuro- and cognitive science. In: Adami, C., Belew, R., Kitano, H., Taylor, C. (Eds.), Proceedings of the Sixth International Conference on Artificial Life. Cambridge, MA, MIT Press, pp Csete, M.E., Doyle, J.C., Reverse engineering of biological complexity. Science 295, Deane, C.M., Salwinski, L., Xenarios, I., Eisenberg, D., Protein interactions. Mol. Cell. Proteomics 1, Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, Fields, S., Song, O., A novel genetic system to detect protein protein interactions. Nature 340, Flake, G.W., Lawrence, S., Giles, C.L., Coetzee, F.M., Self-organization and identification of web communities. IEEE Comput. 35, Freeman, L.C., A set of measures of centrality based on betweenness. Sociometry 40 (1), Girvan, M., Newman, M.E.J., Community structure in social and biological networks. Proc. Natl. Acad. Sci. U.S.A. 99, Holme, P., Kim, B.J., Yoon, C.N., Han, S.K., Attack vulnerability of complex networks. Physical Review E 65, Huberman, B.A., Adamic, L.A., Internet: growth dynamics of the world-wide web. Nature 401, 131. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., Barabasi, A.- L., The large-scale organization of metabolic networks. Nature 407, Kleinberg, J., Lawrence, S., The structure of the web. Science 294, Legrain, P., Selig, L., Genome-wide protein interaction maps using two-hybrid systems. FEBS Lett. 480, Lynch, M., Gene duplication and evolution. Science 297, Newman, M.E., The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. U.S.A. 98, Pastor-Satorras, R., Smith, E., Sole, R.V., Evolving protein interaction networks through gene duplication. Santa Fe Institute Working Paper Ravasz, E., Somera, A.L., Oltvai, Z.N., Barabasi, A.-L., Hierarchical organization of modularity in metabolic networks. Science 297, Ravasz, E., Barabasi, A.-L., Hierarchical organization in complex networks. Physical Review E 67, Rives, A.W., Galitski, T., Modular organization of cellular networks. Proc. Natl. Acad. Sci. U.S.A. 100 (3), Schuster, S., Pfeiffer, T., Moldenhauer, F., Koch, I., Dandekar, T., Exploring the pathway structure of metabolism: decomposition into subnetworks and application to Mycoplasma pneumoniae. Bioinformatics 18,

12 62 J. Hallinan / BioSystems 74 (2004) Schwikowski, B., Uetz, P., Fields, S., A network of interacting proteins in yeast. Nat. Biotechnol. 18, Snel, B., Bork, P., Huynen, M.A., The identification of functional modules from the genomic association of genes. Proc. Natl. Acad. Sci. U.S.A. 99, Sole, R., Pastor-Santorros, R., Complex networks in genomics and proteomics. Santa Fe Institute Working Paper Stelling, J., Klamt, S., Bettenbrock, K., Schuster, S., Gilles, E.D., Metabolic network structure determines key aspects of functionality and regulation. Nature 420, Uetz, P., Glot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S., Rothberg, J.M., A comprehensive analysis of protein protein interactions in Saccharomyces cerevisiae. Nature 403, von Meering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., Bork, P., Comparative assessment of largescale data sets of protein protein interactions. Nature 417, Wagner, A., The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol. Biol. Evol. 18, Watts, D.J., Small-Worlds: The Dynamics of Networks between Order and Randomness. Princeton University Press, Princeton, NJ. Williams, R.J., Martinez, N.D., Simple rules yield complex food webs. Nature 409, Batagelj, V., Mrvar, A., Pajek program for large network analysis. Connections 21,