How To Understand A Protein Network

Transcription

1 Graph-theoretical approaches for studying biological networks by Tijana Milenković Advancement Committee: Prof. Wayne Hayes Prof. Lan Huang Prof. Eric Mjolsness Prof. Zoran Nenadić Prof. Nataša Pržulj, Chair Prof. Xiaohui Xie Department of Computer Science University of California, Irvine Copyright c 2008 by Tijana Milenković

2 Contents 1 Introduction Motivation Types of biological networks Methods for proteome detection Yeast two-hybrid assay Mass spectrometry of purified complexes Other methods Assessment of interaction data quality produced by different methods Completeness and availability of data sets Major challenges Network properties Global network properties Local network properties Network motifs Graphlet-based similarity measures of local network structure Network models Survey of network models An optimized null model for PPI networks i

3 3.2.1 Scale-freeness of PPI networks Geometricity of PPI networks Stickiness of PPI networks Protein function prediction from PPI network topology Motivation Methods for protein function prediction Direct methods Cluster-based methods Biological networks in disease Disease gene identification Disease networks Characterizing drug-drug target relationships Druggable proteins DrugBank Drug-target network Network comparison Types of network comparison methods Algorithms for network alignment Pairwise alignment of PPI networks Multiple PPI network alignment Software tools for network analyses and modeling 77 Bibliography 81 ii

4 Chapter 1 Introduction 1.1 Motivation Recent technological advances in experimental biology have yielded large amounts of biological network data. Many other real-world phenomena have also been described in terms of large networks (also called graphs), such as various types of social (1) and technological (2) networks. Networks are invaluable models for better understanding of biological systems. To understand living cells, one must study them as systems rather than as a collection of individual parts. Whether its constituents are molecules, cells, or living organisms, a network provides a framework to model the complex events that emerge from interactions among these parts. Biological networks come in a variety of forms. Nodes in biological networks represent biomolecules such as genes, proteins or metabolites, and edges connecting these nodes indicate functional, physical or chemical interactions between the corresponding biomolecules. Understanding these complex biological systems has become an important problem that has lead to intensive research in network analyses, modeling, and function and disease gene identification and prediction. The hope is that utilizing such systems-level approaches to analyzing and modeling complex biological systems will provide insights into the inner working of the cell, biological function, and disease. 1

5 2 1.2 Types of biological networks Biological networks include transcriptional regulatory networks (3), metabolic networks (4), signal transduction networks, protein structure networks (5), networks summarizing neuronal connectivities (6), and protein-protein interaction networks (7). Studying biological networks at these various granularities could provide valuable insight about inner working of cells, and might lead to important discoveries about complex diseases. In transcriptional regulation networks, nodes represent genes and edges are directed from a gene that encodes for a transcription factor protein to a gene transcriptionally regulated by that transcription factor (see Figure 1.1). Thus, the network structure is an abstraction of the system s biochemical dynamics that is responsible for regulating the expression of genes in cells. The two best characterized transcriptional regulation networks are those of a eukaryote, the yeast Saccharomyces cerevisiae, and a bacterium, Escherichia coli (8). Figure 1.1: Two examples of biological networks. The figure is taken from (8). One of the most important life processes is the metabolism of an organism, the basic chemical system that generates essential components such as amino acids, sugars and lipids, and the energy required to synthesize them and to use them in creating proteins and cellular structures. A metabolic network represents this system of connected chemical reactions, i.e., the complete set of metabolic and physical processes that determine the physiological and biochemical properties of a cell. Metabolism network reconstruction breaks down metabolism pathways into their respective reactions and enzymes.

6 3 Thus, in these networks, small-molecule substrates can be envisioned as nodes and the links as the enzyme-catalyzed reactions that transform one metabolite into another. With the sequencing of complete genomes, it is now possible to reconstruct the network of biochemical reactions in many organisms, from bacteria to human. These networks are available in several databases, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) (9). Metabolic networks are powerful tools for studying and modeling metabolism. However, graph theoretic description of real-world metabolic networks (Figure 1.2 A) still needs to be established precisely. For example, in the most abstract approach, all interacting metabolites are considered equally and the edges between nodes represent reactions that convert one substrate into another (Figure 1.2 B). However, for many biological applications, it is useful to ignore co-factors, which can result in a completely different type of mapping that connects only the main source metabolites to the main products (Figure 1.2 C). Figure 1.2: Different graph-theory representations of metabolic pathways: (A) realworld metabolic pathway; (B) representation when all interacting metabolites are considered equally; and (C) representation when only the main source metabolites and the main products are considered. The figure is taken from (10).

7 4 Cell signaling is part of a complex system of communication that governs basic cellular activities and coordinates cell actions. The ability of cells to perceive and correctly respond to their environment is the basis of development, tissue repair, and immunity. Errors in cellular information processing are responsible for diseases such as cancer, autoimmunity, and diabetes. Thus, if we want to understand cellular behavior and its responses to external signals, we have to understand the signal transduction networks, i.e., pathways through which these signals are mediated into and within the cell. Whereas some cell-to-cell communication requires direct cell-to-cell contact, many cell signals are carried by molecules that are released by one cell and move to make contact with another cell. Cells receive information from their environment through a class of proteins known as receptors. Molecules that activate (or, in some cases, inhibit) receptors are called receptor ligands. While many receptors are cell surface proteins, some are found inside cells. In some cases, receptor activation caused by ligand binding to a receptor is directly coupled to the cell s response to the ligand. However, for many cell surface receptors, ligand-receptor interactions are not directly linked to the cell s response. The activated receptor must first interact with other proteins inside the cell before the ultimate physiological effect of the ligand on the cell s behavior is produced. Often, the behavior of a chain of several interacting cell proteins is altered following receptor activation. The entire set of cell changes induced by receptor activation is called a signal transduction mechanism or pathway. Proteins are known to function by adopting a unique three dimensional structure, determined by the sequence of amino acids in the polypeptide chain. To learn more about the rules of protein structure, stability and folding, in the past few years, the view and understanding of protein structural space has shifted towards a network (graph) representation of protein structure. In such a framework, protein structures are modeled as residue interaction graphs (RIGs), in which nodes represent amino acid residues and edges describe pair-wise contacts between residues. A contact between two residues

8 5 is defined if the distance between any pair of their heavy atoms is within a specified distance cut-off. Most studies use distance cut-offs that lie in the range [4.0, 5.0] Å (11; 12; 13). An example of a RIG and the corresponding protein is presented in Figure 1.3. Figure 1.3: An illustration of a protein and its corresponding RIG: (A) a protein for which the RIG was formed; (B) the protein and its RIG; (C) RIG alone. In neuroscience, a neural network describes a population of physically interconnected neurons. Communication between neurons often involves an electrochemical process. The interface through which they interact with surrounding neurons usually consists of several dendrites (input connections), which are connected via synapses to other neurons, and one axon (output connection). Networks summarizing neuronal connectivities model synaptic connections between neurons, as illustrated in Figure 1.1. Finally, in protein-protein interaction (PPI) networks, nodes correspond to proteins and undirected edges represent physical interactions amongst them. An example of the PPI network consisting of 11,000 interactions amongst 2,401 yeast proteins (14) is presented in Figure 1.4. In this survey, we mainly focus on PPI networks for the following reason. It is now possible to list the genes and encoded proteins for an increasing number of organisms. However, it is the proteins that execute the genetic programme. Proteins that are actually produced by a cell at any given time constitute its proteome. The proteins form large interaction networks, in which they regulate and support each other, and to fully understand the cellular machinery, simply listing the proteins is not enough - all the interactions between them, i.e., the complete interactome, need to be known as well.

9 6 PPI networks represent an opportunity as well as the challenge. Analyzing these networks may provide useful clues about the function of individual proteins, protein complexes, and larger cellular machines. However, PPI data volume and noisiness is making many algorithms for its analyses intractable (15). Additionally, graph representation of PPI data with nodes and edges corresponding to proteins and protein interactions, respectively, does not address some of the major properties of protein interaction data. It does not deal with the noisiness of the data, i.e., the large number of false positives and negatives. Moreover, all spatial and temporal information is lost, as well as the information about the conditions of biochemical experiments, confidence of interactions, number of experiments confirming the interactions, etc. However, no other model for representing PPI data has been proposed thus far. Despite all these drawbacks, understanding these complex phenomena is crucial and can lead to significant discoveries about complex biological mechanisms and diseases. Figure 1.4: The PPI network with 11,000 interactions amongst 2,401 yeast proteins. The figure is taken from (15).

10 7 1.3 Methods for proteome detection Traditionally, protein interactions have been studied individually by various small-scale biochemical techniques. However, new proteins are being discovered at high speed rates, and the need for large-scale (high-throughput) interaction detection methods has arisen. In general, discoveries in small-scale experiments are assumed to be of a better quality than those by high-throughput experiments, i.e., they are expected to result in higher confidence interaction data sets. However, one could argue that high-throughput experiments have the advantage of being standardized, whereas small-scale experiments are performed differently each time. Additionally, in small-scale experiments, the focus is on the most interesting parts of the proteome, i.e., subsets of proteins that are considered interesting to particular researchers. On the other hand, high-throughput experiments give an unbiased view of the entire proteome. There exist a variety of methods for obtaining rich protein interaction network data. The two methods most commonly used to produce large-scale data sets are yeast 2- hybrid (Y2H) screening (16; 17; 18) and protein complex purification methods using mass-spectrometry (e.g., TAP (19; 20) and HMS-PCI (21)). Other relevant methods include correlated messenger RNA (mrna) expression profiles (22), genetic interactions (23), or in silico (computed) interaction prediction methods derived from gene context analysis (24; 25). A survey of biochemical methods used to identify PPIs can be found in (26). Although all of the above mentioned techniques can be used for protein interaction prediction, their goals are different. Yeast two-hybrid and mass spectrometry techniques aim to detect physical binding between proteins, whereas genetic interactions, mrna coexpression and in silico methods seek to predict functional associations, for example, between a transcriptional regulator and the pathway it controls. In many cases, however, such functional associations do take the form of physical binding.

11 Yeast two-hybrid assay Yeast two-hybrid (Y2H) system is a method for testing pairwise protein-protein interactions that has been used for nearly two decades (27). However, the system has been only recently used for high-throughput discovery of PPIs in yeast (16; 17; 18). Additionally, several Y2H studies discovering novel interactions in fruitfly (28), worm (29), and human (30; 31) have recently been introduced. The method works as follows. Pairs of proteins to be tested for interaction are expressed as fusion proteins ( hybrids ) in yeast: one protein is fused to a DNA-binding domain and the other to a transcriptional activator domain. Any interaction between them is detected by the formation of a functional transcription factor. Of course, forcing two proteins together will give rise to a high false-positive rate, in the sense that although these proteins truly physically bind they will never do so inside cells, because of different localization, or because they are never simultaneously expressed. False negatives may occur because some components that are crucial for interaction might be lacking due to localizing the hybrid proteins in the nucleus and due to expressing nonyeast proteins in yeast. Whereas Y2H is an in vivo technique capable of detecting transient and unstable interactions, only two proteins are tested at a time. Additionally, it takes place in the nucleus, so many proteins are not in their native compartment Mass spectrometry of purified complexes With methods of this type, individual proteins are tagged and used as hooks to biochemically purify whole protein complexes. These are then separated and their components identified by mass spectrometry (MS). Two protocols exist: tandem affinity purification (TAP) (19; 20) and high-throughput mass-spectrometric protein complex identification (HMS-PCI) (21)). Here, we mainly focus on TAP.

12 9 The TAP technology has allowed the dissection of hundreds of protein complexes from yeast (20; 32; 33). Although no comprehensive TAP purification strategy towards animal or plant PPI networks has been undertaken, improvements of the TAP tag for purification of TAP complexes from these organisms and the development of highly sensitive and accurate mass spectrometers will allow such analysis in the near future (34). When generating networks from the TAP experiments, some authors assume edges between the tagged protein, i.e., the bait, and any other protein that is copurified with it, i.e., a pray, as well as between all pairs of prays. This matrix model will therefore join all proteins within the same complex by edges, thus forming a clique, while this does not necessarily mean direct physical binding between them. Another alternative is the spoke model that assumes only edges between the bait and each of the prays, without connecting any of the prays with each other. Whereas the spoke model obviously introduces fewer false positives compared to the matrix model, it can miss true interactions. On the other hand, it is obvious that with matrix model, all true interactions are captured, at the cost of introducing numerous false positives. It has been shown that the spoke model increases the accuracy of interaction data sets compared to the matrix model, both for TAP and HMS-PCI (14). With TAP and similar methods, several members of a complex can be tagged, giving an internal check for consistency. However, these methods might miss some complexes that are not present under the given conditions. Additionally, tagging may disturb complex formation, and loosely associated components may be washed off during purification. Despite these drawbacks, it has been shown that computational discovery of protein complexes from TAP-derived networks is more accurate than from Y2Hderived networks (33). This was done by comparing predicted complexes to the known ones present in the MIPS database (23). However, this result is expected, since networks derived by TAP experiments explicitly include information about the protein complexes through the additional indirect edges.

13 Other methods Correlated mrna expression The first step in protein synthesis, the process by which a gene produces protein, is transcription, by which genes are transcribed to messenger RNA (mrna). The second step is splicing, where regions that are not coding for proteins are removed from sequence. The process of synthesizing a protein from mrna is known as translation. Translation is the final step by which a protein is produced. mrna abundance of a gene is the amount of mrna transcribed and it correlates with gene expression, i.e., with the rate at which a gene produces the corresponding protein. As almost all biological processes are carried out by proteins, a direct measurement of protein level in a cell might seem more appropriate. However, this is harder to measure than mrna abundance. With correlated mrna expression method for PPI detection, mrna levels are systematically measured under a variety of different cellular conditions, and genes are grouped if they show a similar transcriptional response to these conditions. These groups are enriched in genes encoding physically interacting proteins (35). This is an in vivo technique with a broader coverage of cellular conditions than other methods. However, it is a relatively inaccurate predictor of direct physical interaction, and it is very sensitive to parameter choices and clustering methods during analysis. Genetic interactions (synthetic lethality) Two nonessential genes that cause lethality when mutated at the same time form a synthetic lethal interaction. Such genes are often functionally associated and their encoded proteins may also interact physically. This is an in vivo technique capable of producing unbiased genome-wide screens.

14 11 In silico predictions through genome analysis Whole genomes can be screened for types of interaction evidence. For example, interacting proteins have a tendency to be either present or absent together from fully sequenced genomes. This can be considered as an indication for a physical interaction (14). Although in silico techniques are fast and inexpensive and coverage expands as more genomes are sequenced, they require a framework for assigning orthology between proteins, failing where orthology relationships are not clear Assessment of interaction data quality produced by different methods Comparing interaction data produced by these different methods is difficult, because interactions are often derived under different conditions or they come in different formats. It is not surprising that data sets produced by different methods are often complimentary (34; 14). Moreover, even data sets obtained by the same technique can complement each other to some extent (14). The differences in the methods described above have caused some confusion within the scientific community. The term proteinprotein interaction carries two meanings: direct physical binding, or membership of the same multiprotein complex. The latter usage is very common; for example, both major efforts to map protein complexes in yeast (32; 33) describe interactions between co-complexed proteins and identify PPI networks consisting of these interactions (36). Von Mering et al. (14) have performed a comprehensive comparative assessment of large-scale yeast protein interaction data sets. As the trusted reference for evaluating the interaction quality, they relied on manually curated catalogues of known protein complexes available in Munich Information Center for Protein Sequences (MIPS) (23), and the Yeast Proteome Database (YPD) (37). They found that out of about 80,000 interactions available for yeast at that time, only about 2,400 were supported by more

15 12 than one method. When assessing the quality of interaction data, both coverage and accuracy were considered together. A data set of high coverage is not very useful if its accuracy is low, and vice versa. However, none of the methods covered more than 60% of the proteins in the yeast genome. Additionally, the authors pointed the major bias in all data sets towards proteins of high abundance. The authors based their assessments on various criteria when comparing interaction data against the trusted reference. For example, an indirect measure of quality was the degree to which interacting proteins were annotated with the same functional category, or were localized in the same subcellular compartments. It was expected that proteins of broadly related functions preferentially interacted with each other (14). (This fact has recently been applied to protein function prediction - see Chapter 4 for details.) Additionally, the authors showed that protein localization data provided an independent measure of quality for different data sets, because proteins known to interact were usually localized similarly. Based on their criteria, the authors compared all interactions present in more than one data set to the trusted reference data set, and labeled these interactions as high-confidence interactions. Not only that the overlap of high-throughput data was around 20 times larger than expected by chance, but it consisted mainly of interactions in which both partners had the same functional category and subcellular localization. Both of these observations suggested that the overlap consisted largely of true positives. The study resulted in the high-confidence yeast PPI network consisting of 2,455 interactions amongst 988 proteins. Based on their conclusions, more than half of the total of 80,000 interactions were false positives. Moreover, the friction of false positives in Y2H data sets was also predicted to be about 50%. Two important TAP data sets were introduced only recently (32; 33) and were not included in the above mentioned study. However, the reliabilities of interactions originating from these two data sets were later evaluated by Collins et al. (38) (see below for details). Gavin et al. (32) analyzed all ORFs of yeast Saccharomyces cerevisiae and reported the first genome-wide screen for complexes in an organism, budding yeast,

16 13 using TAP and mass spectrometry. The authors observed that 64% of known complexes as defined in MIPS (23) were retrieved several times, resulting in a high coverage of known components. In total, they have partitioned the ensemble of cellular proteins into 491 complexes, 257 of which were novel. This resulted in an interaction data set consisting of 19,435 interactions amongst 2,231 proteins. Krogan et al. (33) identified high-quality protein interactions in yeast by encompassing results from two purifications and applying machine learning algorithms to assign confidence to each interaction. The study resulted in 7,123 high-confidence interactions amongst 2,708 proteins. Next, the authors used MIPS complex catalogue (23) as a reference database and the Markov clustering algorithm to organize these interactions into 547 protein complexes, about half of them being absent from the MIPS database. Recently, Collins et al. (38) combined these two important TAP datasets (32; 33). The practical utility of such high-throughput interaction sets is substantially decreased by the presence of false positives. Collins et al. (38) created a novel probabilistic metric that took advantage of the high density of these data, including both the presence and absence of individual associations, to provide a measure of the relative confidence of each potential protein-protein interaction. This analysis largely overcame the noise inherent in the high-throughput experiments, resulting in the high confidence network consisting of 9,074 interactions amongst 1,622 proteins, comparable by quality to small-scale experiments. Additionally, the authors organized proteins into coherent multisubunit complexes using hierarchical clustering. Finally, Pu et al. (39) explored the PPI network reported by Collins et al. (38) to derive protein complexes. The authors showed that protein-complexes that they detected from this network occurred with highest reliability compared to other datasets, thus demonstrating that their descriptions of protein complexes were more accurate and meaningful than those previously published.

17 Completeness and availability of data sets Regrettably, the various experimental approaches for identification of protein interactions are error prone, leading to noisy results when applied on a proteomic scale. Given that the experimental detection of protein interactions is not straightforward, even precise and time-consuming experiments often produce low coverage. As a consequence, despite the exponential increase in the amount of protein interaction information, until now only the yeast Saccharomyces cerevisiae has been investigated systematically in experiments designed to detect all of the possible interactions. Due to its ease of genetic manipulation, yeast has been one of the most extensively studied organisms. However, even for yeast, two comparable, comprehensive experiments, performed in parallel by two different groups using the same (TAP) approach (32; 33), ended up with fewer than 30% common interactions (40). In addition to yeast, three more species, fruitfly Drosophila melanogaster, worm Caenorhabditis elegans, and human Homosapiens, have received considerable experimental attention, such that interactomes exceeding 5,000 links can be assembled from public databases. Beside data sets obtained by biochemical experimental techniques presented above, several efforts have been made to build large human PPI networks by assembling data sets published in previous studies or curated in the literature. One such approach resulted in Human Protein Reference Database (HPRD) 1 (41) containing more than 38,000 interactions. Additionally, computational methods have been used to predict interactions. For example, a human PPI network consisting of about 343,000 unique interactions amongst about 8,500 human proteins was created by pooling human interaction data from several databases and also by transferring interactions from model organisms. Out of these, about 62,000 interactions were of high-confidence (42). Datasets resulting from small- and large-scale screens are now publicly available 1

18 15 in several databases including: Saccharomyces Genome Database (SGD) 2 (43), Munich Information Center for Protein Sequences (MIPS) 3 (23), the Database of Interacting Proteins (DIP) 4 (44), the Molecular Interactions Database (MINT) 5 (45), the Online Predicted Human Interaction Database (OPHID) 6 (46), Human Protein Reference Database (HPRD) 7 (41), and the Biological General Repository for Interaction Datasets (BioGRID) 8 (47). For a more detailed survey of these databases, see (15). The overlap between the databases is very small (34), making it difficult to obtain confidence in the interactions. On the other hand, it could be argued that each such data set contains a different, slightly overlapping sample of the entire network and that combining them would provide a better estimate of complete protein interaction networks. This idea may be supported by the fact that estimated sizes of PPI networks exceed the number of interactions currently stored in any of the databases. To estimate the completeness of the interactomes of different species, the comparison between the human interactome available in HPRD (41) with the available interaction data sets for yeast, fly, and worm has been performed (48). Since physical interactions of one organism are expected to be conserved in other related organisms, and since the analyzed human PPI data set is large, the authors investigated the extent to which the human PPIs overlap with those reported in yeast, fly, and worm. To quantify overlaps between species, they first identified orthologs by using all-versus-all BLAST search followed by clustering into orthologous groups, thus identifying 1,131 yeast, 3,347 fly, and 1,079 worm genes that had high-confidence human orthologs. These orthologous pairs were used to systematically examine overlap amongst a total of about 70,000 interactions (25,464 human, 16,069 yeast, 5,625 worm, and 24,587 fly pairwise interactions)

19 16 As illustrated in Figure 1.5, the lack of overlap was observed, which was somewhat surprising given the large number of interactions in each of the analyzed data sets. 63 out of the maximum of 1,405 (i.e., 4.5 %) PPIs were detected in the human-fly comparison, 105 out of the maximum of 266 (i.e., 39.5 %) PPIs were detected in the human-worm comparison, and 42 out of the maximum of 288 (i.e., 14.6 %) PPIs overlapped in all three species. Figure 1.5: Overlap of PPI data sets (48). However, comparisons of interactomes of different species severely depend on the network completeness. Although maps of both the yeast and human protein interaction networks are well under way, their completion poses many problems. The anticipated scale of the human network could require multiple testing of all possible pairs of around 20,000-25,000 human proteins, i.e., roughly 200 million to 300 million pairs (36). The scale of this effort raises many questions. How do we even measure completion? The network is, after all, unknown. How close are we to completing the networks? How do we assess errors in the maps? Would maps obtained using only a single technique suffice? It has been shown that although large numbers of interactions have been mapped for yeast and human, false-positive rates are so high that only about half of the expected yeast network has been defined to date, and considerably less for the human one (36). Additionally, it has been argued that the sizes of the complete yeast and human protein

20 17 interaction networks will be larger than most early estimates. Whereas previous studies have estimated 10,000-30,000 total yeast interactions, new estimates are twice as large - about 29,000-58,000 interactions amongst about 5,800 yeast genes. Similarly, it has been predicted that human interactome will have about 260,000 interactions between 20,000-25,000 human proteins (36). Current coverage is thus low. Moreover, errors are common enough in individual datasets. For example, the full Y2H data set published by Ito et al. (16) has a measured false-positive rate of about 80% (36). Thus, interactomes will only be fully mapped through integration of repeated analyses from many groups. 1.5 Major challenges Major challenges when studying biological networks include network analyses, comparisons, modeling, and alignments aimed at discovering a relationship between network topology on one side and biological function, disease, and evolution on the other. Analogous to genetic sequence comparison, comparing large cellular networks will revolutionize biological understanding. However, comparing large networks is computationally infeasible due to NP-completeness of the underlying subgraph isomorphism problem (49). Note that even if the subgraph isomorphism was feasible, it would not find a practical application in biological network comparisons, since biological networks are extremely unlikely to be isomorphic. Thus, large network analyses and comparisons rely on heuristics, commonly called network parameters or properties. These properties are roughly categorized into global and local. The most widely used global properties are the degree distribution, the clustering coefficient, the clustering spectrum, the average diameter and the spectrum of shortest path lengths (50). Local properties include network motifs, small over-represented subgraphs (8; 51; 52), and two measures based on graphlets, small induced subgraphs of large networks: the relative graphlet frequency distance (RGF-distance), which compares the frequencies of the appearance of graphlets in two networks (53), and the graphlet degree distribution agreement (GDD-agreement),

21 18 which is a graphlet-based generalization of the degree distribution (54). These properties have been used to compare biological networks against model networks and to find well-fitting network models for biological networks (53; 54; 55), as well as to suggest biological function of proteins in PPI networks (56). The properties of models have further been exploited to guide biological experiments and discover new biological features, as well as to propose efficient heuristic strategies in many domains, including time- and cost-optimal detection of the human interactome (57). Modeling biological networks is of crucial importance for any computational study of these networks. Only a well-fitting network model that precisely reproduces the network structure and laws through which the network has emerged can enable us to understand and replicate the biological processes and the underlying complex evolutionary mechanisms in the cell. Various network models have been proposed for real-world biological networks. Starting with Erdös-Rényi random graphs (58), network models have progressed through a series of versions designed to match certain properties of real-world networks. Examples include random graphs that match the degree distribution of the data (59), network growth models that produce networks with scale-free degree distributions (60) or small network diameters (61), geometric random graphs (62), or networks that reproduce some biological and topological properties of real biological networks (e.g., stickiness model (55)). An open-source software tool called GraphCrunch (63) implements the latest research on biological network models and properties and compares real-world networks against a variety of network models with respect to a wide range of network properties. Since proteins are essential macromolecules of life, understanding their function and their role in disease is of great importance. However, the number of functionally unclassified proteins is large even for simple and well studied organisms such as baker s yeast (64). Moreover, it is still unclear in what cellular states serious diseases, such as cancer, occur. Methods for determining protein function and identifying disease genes have shifted their focus from targeting specific proteins based solely on sequence ho-

22 19 mology to analyses of the entire proteome based on PPI networks (56). Since proteins interact to perform a certain function, PPI networks by definition reflect the interconnected nature of biological processes. Therefore, analyzing structural properties of PPI networks may provide useful clues about the biological function of individual proteins, protein complexes, pathways they participate in, and larger subcellular machines (56; 65). Additionally, recent studies have been investigating associations between diseases and network topology in protein-protein interaction networks and have shown that disease genes share common topological properties (66; 67). Moreover, PPI networks have recently been combined with the networks describing the relationships between diseases and disease genes causing them (68), as well as between drugs and their protein targets (69), thus giving new insights into pharmacology. Finding the relationship between PPI network topology and biological function and disease is one of the most challenging problems in the post-genomic era. Structural properties of biological networks are also being extensively used in other biological applications. As more biological network data becomes available, comparative analyses of these networks across species are proving to be valuable, since such systems-biology types of comparisons may lead to exciting discoveries in evolutionary biology. For example, comparing networks of different organisms might provide deeper insights into conservation of proteins, their function, complexes and pathways they are involved in, and protein-protein interactions through evolution. The most common methods for such network comparisons are network alignments. The alignment is achieved by constructing a mapping between the nodes of networks being compared and the corresponding sets of edges. In this process, topologically and functionally similar regions of biological networks are discovered. Whereas previous studies have focused on network alignment based solely on biological functional information such as sequence similarity of proteins in PPI networks, recent studies have been combining the functional information with topological network information to align biological networks (70).

23 20 These major challenges in studying biological networks are illustrated in Figure 1.6. In the following chapters of this survey, these problems are described in more details and up-to-date methods addressing them are discussed. Figure 1.6: The major challenges when studying biological (PPI) networks. The figure is partially taken from (71).

24 Chapter 2 Network properties 2.1 Global network properties Global network properties give an overall view of a network. The most commonly used global network properties are the degree distribution, the average network diameter, the spectrum of shortest path lengths, the average clustering coefficient, and the clustering spectrum. The degree of a node is the number of edges incident to the node. The degree distribution, P(k), describes the probability that a node has degree k. The smallest number of links that have to be traversed to get from a node x to a node y in a network is called the distance between nodes x and y and a path through the network that achieves this distance is called the shortest path between nodes x and y. The average of shortest path lengths over all pairs of nodes in a network is called the average network diameter. The spectrum of shortest path lengths is the distribution of shortest path lengths between all pairs of nodes in a network. The clustering coefficient of a node z in a network, C z, is defined as the probability that two nodes x and y which are connected to the node z are themselves connected. The average of C z over all nodes z of a network is the clustering coefficient, C, of the network; it measures the tendency of the network to form highly interconnected regions called clusters. The distribution of the average clustering coefficients of all nodes of degree k in a network is the clustering spectrum, C(k). 21

25 22 Another important concept is centrality, which quantifies the topological importance of a node (or edge) in a network. Several centrality measures have been proposed: degree centrality, according to which nodes with a large number of edges have high centrality, closeness centrality, according to which nodes with short paths to all other nodes have high centrality, and betweenness centrality, according to which nodes (or edges) which occur in many of the shortest paths have high centrality. 2.2 Local network properties Global network properties may not be detailed enough to capture complex topological characteristics of real-world networks. Network properties should encompass large number of constraints, in order to reduce degrees of freedom in which networks being compared can vary. Thus, we next present more constraining measures of network structure, i.e., local network properties. Local properties include network motifs, small over-represented subgraphs (8; 51; 52), and two highly constraining graphlet-based (53) measures of local structural similarities between two networks: RGF-distance (53) and GDD-agreement (54) Network motifs Motifs are small sub-graphs that are overrepresented in a network when compared to a null model (8; 51; 52). It has been argued that motifs may provide insight into both the structure and function of regions of the whole network, and even help to develop models for the evolution of biological networks (8; 51). Some authors believe that motifs can define universal classes of networks (52). Motifs can be identified in directed as well as undirected networks. Obviously, there are many more directed subgraphs than undirected ones: for example, there are 13 unique directed 3-node motifs (presented in Figure 2.1), while there are just two undirected ones (a three-node path and a triangle). All directed 3- and 4-node motifs have been identified in a variety of biochemical,

26 23 ecological, neurobiological, and engineering networks (8). To identify subgraphs that were likely to be important, real-world networks were compared to model networks drawn from the random network model that preserves the degree distribution of the data. Only patterns that appeared in the real-world networks at significantly higher frequencies than in randomized networks were identified as motifs. However, patterns that were functionally but not statistically significant could have been missed by this approach. The authors observed that networks of similar types shared the same motifs, whereas networks of different types did not. Thus, they implied that motifs indeed reflected the underlying processes that had generated each network type and that they could define broad classes of networks. Figure 2.1: All directed 3-node subgraphs. The figure is taken from (8). This has been later confirmed by the same research group (52). The authors compared the local structure of networks from different fields based on their n-dimensional significance profiles, where a significance profile of a network summarized the occurrence (i.e., the statistical significance) of each of n subgraphs in the network with respect to its occurrences in the corresponding model networks. For directed networks analyzed in this study, all 13 directed 3-node subgraphs were taken into account when defining significance profile of a network. For undirected networks, all 6 4-node subgraphs were observed, due to the limited number of undirected 3-node subgraphs. Based on this analysis, several superfamilies of networks with similar significance profiles emerged (52): not only that networks of the same type had similar significance profiles, but so did networks within a superfamily describing different systems of very different sizes. One

27 24 explanation for observing superfamilies could be that distinct evolutionary processes led to similar local structures, which would mean that the similarity of the significance profiles of networks within a superfamily was purely accidental. However, the authors argued that it was also possible that the different systems in a superfamily indeed had similar circuit elements since they had evolved to perform similar tasks. Thus, abundance of a given motif when compared to a reasonable null model may be an interesting signal. However, one should be careful when relating such findings to functional biological aspects: which null model to use is still a controversial topic. All of the above mentioned studies used the simple random graph model that preserved the degree distribution of the data and did not preinclude any evolutionary information. But given the networks of different types, different null models may need to be used. As alternatives to the random network model used in the studies by Milo et al. (8; 52), networks originating from two additional network models were used for identification of network motifs (72). These models were a random lattice (geometrical) model, according to which nodes close in space tended to interact with higher probabilities that two distant nodes, and a preferential attachment scale-free model (72). The authors showed that the same network motifs were identified when the real-world networks were compared to model networks drawn from these two models as when random graph model with the same degree distribution as the data was used. Thus, they argued that the observed network motifs could have arisen not by network evolution, but by other mechanisms such as spatial proximity or preferential attachment. Therefore, they concluded that it was not necessarily true that motifs carry a functional information. As a response, Milo et al. (73) provided a counter example showing that geometric properties (i.e., spatial proximity) and preferential attachment do not seem to explain to structure of real-world networks, which they did by comparing significance profiles of real-world networks and geometric model (preferential attachment) networks using random network model with the same degree distribution as the reference model. However, the previous study (72) showed that one should be very careful when comparing

28 25 significance profiles of networks of different types based on a common, but inappropriate null model, since this could give the wrong impression that different networks are in fact similar with respect to their motif significance profile Graphlet-based similarity measures of local network structure RGF-distance and GDD-agreement are based on graphlets, small connected non-isomorphic induced subgraphs of large networks (53). Note that graphlets are different from network motifs since they must be induced subgraphs (motifs are partial subgraphs) and since they do not need to be over-represented in the data when compared with randomized networks. An induced subgraph of a graph G on a subset S of nodes of G is obtained by taking S and all edges of G having both end-points in S; partial subgraphs are obtained by taking S and some of the edges of G having both end-points in S. Since the number of graphlets on n nodes increases super-exponentially with n, RGFdistance and GDD-agreement computations are currently based on 3-5-node graphlets (presented in Figure 2.2). 3-node graphlets 4-node graphlets node graphlets Figure 2.2: All 3-node, 4-node and 5-node graphlets (53).

29 26 RGF-distance compares the frequencies of the appearance of all 3-5-node graphlets in two networks (53). The individual frequency of the appearance of each of these graphlets in a network is first normalized by dividing it with the total number of all graphlets in the network, thus accounting for the current incompleteness of the data. Then, the logarithm function is applied to the normalized (i.e., relative) frequency of a graphlet; frequencies of different graphlets can differ by several orders of magnitude and RGF-distance should not to be entirely dominated by the most frequent graphlets. Next, the relative distance between two networks for a given graphlet is found by subtracting logarithmized relative frequencies of the graphlet in the two networks. Finally, the total RGF-distance is found as a sum of individual distances for all node graphlets. If networks being compared have the same number of nodes and edges, the frequencies of occurrence of the only 1-node graphlet (a node) and the only 2-node graphlet (an edge) are also taken into account by this measure. Thus, RGF-distance encompasses 31 similarity constraints by examining the fit of 31 graphlet frequencies. GDD-agreement generalizes the notion of the degree distribution to the spectrum of graphlet degree distributions (GDDs) in the following way (54). The degree distribution measures the number of nodes of degree k, i.e., the number of nodes touching k edges, for each value of k. Note that an edge is the only graphlet with two nodes (graphlet denoted by G 0 in Figure 2.3). GDDs generalize the degree distribution to other graphlets: they measure for each graphlet G i, i 0, 1,..., 29, (illustrated in Figure 2.3) the number of nodes touching k graphlets G i at a particular node. A node at which a graphlet is touched is topologically relevant, since it allows us to distinguish between nodes touching, for example, a copy of graphlet G 1 in Figure 2.3 at an end-node, or at the middle node. This is summarized by automorphism orbits (or just orbits, for brevity), as illustrated in Figure 2.3: for graphlets G 0, G 1,..., G 29, there are 73 different orbits, numerated from 0 to 72 (see (54) for details). For each orbit j, the j th GDD, i.e., the distribution of the number of nodes touching the corresponding graphlet at orbit j, is measured. Thus, the degree distribution is the 0 th GDD. The

30 27 j th GDD-agreement compares the j th GDDs of two networks (see (54) for details). The total GDD-agreement between two networks is the arithmetic or the geometric average of the j th GDD-agreements over all j (henceforth arithmetic and geometric averages are denoted by amean and gmean, respectively). GDD-agreement is scaled to always be between 0 and 1, where 1 means that two networks are identical with respect to this property. By calculating the fit of each of the 73 GDDs of the networks being compared, GDD-agreement encompasses 73 similarity constraints. Furthermore, each of these 73 constraints enforces a similarity of two distributions, additionally restricting the ways in which the networks being compared can differ. (Note that the degree distribution is only one of these 73 constraints.) Therefore, GDD-agreement is a very strong measure of structural similarity between two networks. Both of the RGF-distance and GDD-agreement measures were used to discover a new, well-fitting, geometric random graph model of PPI networks (53; 54). In Section 3.2, we illustrate the biological importance of choosing a well-fitting network null model of PPI networks. 2-node graphlet G G9 G10 G12 G 13 G 14 G 15 G17 G18 G G 20 3-node graphlets G G 11 G G G 22 5-node graphlets G G G G 24 4-node graphlets G G G 6 G G G G G G G 29 Figure 2.3: The thirty 2-, 3-, 4-, and 5-node graphlets G 0, G 1,..., G 29 and their automorphism orbits 0, 1, 2,..., 72. In a graphlet G i, i {0, 1,...29}, nodes belonging to the same orbit are of the same shade (54).

31 28 Although local network properties are more constraining than global ones, they require search for the occurrences of small subgraphs in large networks. This process is computationally intensive and exhaustive searches become computationally infeasible even when applied to currently incomplete PPI networks. Thus, heuristic algorithms for subgraph search will unquestionably be needed as PPI data becomes more complete. For this reason, two heuristic approaches for estimating graphlet frequency distributions in networks have already been proposed (74): Targeted Node Processing (TNP) and Neighborhood Local Search (NLS). TNP heuristic approach identifies a small part of the network in which graphlets can be quickly found exhaustively, typically a sparse periphery of the network, thus separating nodes that are easy to process from those that are hard to process. Then, it uses the obtained graphlet frequency distribution to estimate the graphlet frequency distribution in the entire network. NLS randomly chooses a seed node in a network and searches in its neighborhood for a specific graphlet. While the TNP approach processes only the periphery of the network, NLS randomly samples the network and each part of the network has the same probability to be sampled. TNP and NLS achieve accurate graphlet frequency distribution estimates and times faster than the exhaustive searches, respectively (74).

32 Chapter 3 Network models 3.1 Survey of network models Many theoretical network models have been proposed for protein-protein interaction networks (see Figure 3.1). Erdös-Rényi random graphs ( ER ) are based on the principle that the probability that there is an edge between any pair of nodes is distributed uniformly at random. Erdös and Rényi have defined several variants of the model. The most commonly studied one is denoted by G n,p, where each possible edge in the graph on n nodes is present with probability p and absent with probability 1 p. Despite the simplicity of the model and the very few parameters (n, p), these networks are capable to show an impressive number of non-trivial behaviors. They are, of course, quite uniform or democratic: every node has the same average neighborhood. This statistical homogeneity is essentially the reason why these networks have small diameters, Poisson degree distributions, and low clustering coefficients, and thus do not provide a good fit to real-world PPI networks which typically have small diameters, but power-law degree distributions and high clustering coefficients. Random graphs with the same degree distribution as the data ( ER-DD ) capture the degree distribution of a real-world network while leaving all other aspects as in Erdös-Rényi random model. They can be generated by using the stubs method (see 29

33 30 section IV.B.1 of (50) for details): the number of stubs (to be filled by edges) is assigned to each node in the model network according to the degree distribution of the real-world network; edges are created between pairs of nodes picked at random; after an edge is created, the number of stubs left available at the corresponding end-nodes of the edge is decreased by one. Thus, these networks preserve the degree distribution and small diameters of the real-world networks. However, this model also produces networks with low clustering coefficients and thus other network models have been sought. One such example are small-world networks (61). These networks are created from regular ring lattices by random rewiring of a small percentage of their edges. However, although these networks have high clustering coefficients and small diameters, they fail to reproduce power-law degree distributions of real-world networks. Scale-free networks are characterized by power-law degree distributions. One such model is the Barabási-Albert preferential attachment model (60) ( SF-BA ), in which newly added nodes preferentially attach to existing nodes with probability proportional to the degree of the target node. It has been shown that the starting configuration strongly influences the properties of the resulting networks (60). Other variants focused on modeling PPI networks include scale-free network models constructed by mimicking gene duplications and mutations (75; 76). These duplication and divergence models, in which individual nodes are occasionally copied and subsequently mutated with a certain probability, are more biologically motivated and can produce power law distributions as well. In scale-free model networks, connectivity of some nodes is significantly higher than for the other nodes, resulting in power-law degree distribution. Thus, these networks are more robust towards random node removal than ER-networks, but are more sensitive to targeted attacks of the high-degree nodes (77). If PPI networks are scale-free, this could suggest that highly connected nodes in these networks are more important than low-degree nodes. Although the degree distribution of scale-free model networks follows power-law and the average diameter is small, they typically still have low clustering coefficients.

34 31 High clustering coefficients of real-world networks are well reproduced by geometric random graphs ( GEO ) that are defined as follows: nodes correspond to uniformly randomly distributed points in a metric space and edges are created between pairs of nodes if the corresponding points are close enough in the metric space according to some distance norm (62). For example, 3-dimensional Euclidean boxes and the Euclidean distance norm ( GEO-3D ) have been used to model PPI networks (53; 54). Although this model creates networks with high clustering coefficients and small diameters, it still fails to reproduce power-law degree distributions of real-world PPI networks. Instead, geometric random graphs have Poisson degree distribution. However, it has been argued that power-law degree distributions in PPI networks are an artifact of noise present in them (78; 79). Finally, stickiness network model ( STICKY ) is based on stickiness indices, numbers that summarize node connectivities and thus also the complexities of binding domains of proteins in protein-protein interaction (PPI) networks. The probability that there is an edge between two nodes in this network model is directly proportional to the stickiness indices of nodes, i.e., to the degrees of their corresponding proteins in real-world PPI networks (see (55) for details). Networks produced by this model have the expected degree distribution of a real-world network. Additionally, they mimic well the clustering coefficients and the diameters of real-world networks. 3.2 An optimized null model for PPI networks Modeling bio-chemical networks is a vibrant research area. The choice of an appropriate null model can have important implications for many graph-based analyses of these networks. For example, the use of an adequate null model is vital for structural motif discovery, which requires comparing real-world networks with randomized ones (72; 73). Using an inappropriate network null model may identify as overrepresented (underrepresented) subgraphs that otherwise would not have been identified. Another example

35 32 Figure 3.1: Some of the network models that have been proposed for PPI networks. is that a good null model can be used to guide biological experiments in a time- and cost-optimal way and thus minimize the costs of interactome detection by predicting the behavior of a system (57). Since incorrect models lead to incorrect predictions, it is vital to have as accurate a model as possible Scale-freeness of PPI networks As new biological network data becomes available, we must ensure that the theoretical models continue to accurately represent the data. The scale-free model has been assumed to provide such a model for PPI networks. Several authors have shown that the degree distributions of most PPI networks are well fit by a power law, indicating that these are scale-free networks in which most proteins have a small number of neighbors, while a small number of proteins are hubs having a large number of neighbors. Here, it is important to distinguish between two types of hubs: party hubs whose genes are co-expressed with all their neighbors genes over many physiological conditions, and date hubs whose genes are co-expressed with only one or few neighbors genes in each

36 33 physiological condition (80). The latter are thus not true hubs since their degree is low and depends on the physiological state. However, if PPI networks are indeed scale-free is still not clear. There has been a hot discussion about the interpretation of the power law observed in the degree distribution of most of real world data. Real world data are noisy, inaccurate, and incomplete, and thus, data are sampled from a potentially much wider network. To assess the validity of the power law findings, some authors demonstrated that sampling from a scale-free network resulted in a non-scale-free network (78). More importantly, it was shown that a power law could be observed in networks obtained by sampling from networks having degree-distributions very distinct from power laws (79). More precisely, the authors generated graphs belonging to four network models with quite different topologies: random, exponential, power law, and truncated normal. A partial sampling from all of these different networks resulted in sub-networks with topological characteristics that were virtually indistinguishable from those of current (partial) PPI networks. Their conclusion was that, with the current limited coverage levels, the observed scale-free topology of existing PPI networks could not be confidently extrapolated to complete PPI networks Geometricity of PPI networks Moreover, in the light of new PPI network data, several studies have started questioning the wellness of fit of scale-free network model. For example, Pržulj et al. (53) and Pržulj (54) have used two highly constraining measures of local network structures to compare real-world PPI networks to various network models and have shown compelling evidence that the structure of yeast PPI networks is closer to the geometric random graph model than to the widely accepted scale-free model. Furthermore, Higham et al. (81) have designed a method for embedding networks into a low-dimensional Euclidean space and demonstrated that PPI networks can be embedded and thus have a geometric graph structure (see below).

37 34 In search of a well fitting null model for biological networks, one has to consider biological properties of a system being modeled. Geometric random graph model of PPI networks is biologically motivated. Genes and proteins as their products exist in some highly-dimensional biochemical space. Currently accepted paradigm is that evolution happens through a series of gene duplication and mutation events. Thus, after a parent gene is duplicated, the child gene is at the same position in the biochemical space as the parent and therefore inherits interactions with all interacting partners of the parent. Evolutionary optimization then acts on the child gene to either become obsolete and disappear from the genome, or to mutate distancing itself somewhat from the parent, but preserving some of the parent s interacting partners (due to proximity to the parent in the biochemical space) while also establishing new interactions with other genes (due to the short distance of the mutated child from the parent in the space). Similarly, in geometric random graphs, the closer the nodes are in a metric space, the more interactors they will have in common, and vice-versa. Thus, the superior fit of geometric random graphs to PPI networks over other random models is not surprising. A well-fitting null model should generate graphs that closely resemble the structure of real networks. This closeness in structure is reflected across a wide range of statistical measures, i.e., network properties. Thus, testing the fit of a model entails comparing model-derived random graphs to real networks according to these measures. Global network properties, such as the degree distribution, may not be detailed enough to capture the complex topological characteristics of PPI networks, as illustrated in Figure 3.1. The figure shows, among others, the hierarchical model that was proposed only because it matched some of the global network properties of PPI networks. However, it is obvious that PPI networks do not have such structure. Thus, the more rigorous properties of network topology have to be used for the purpose of finding a well fitting null model. The more constraining the measures are, the fewer degrees of freedom exist in which the compared networks can vary. Thus, by using highly constraining measures, such as RGF-distance and GDD-agreement, a better-fitting null model can be found.

38 35 RGF-distance was used to compare PPI networks of yeast S. cerevisiae and fruitfly D. melanogaster to a variety of network models and to show the supremacy of the fit of geometric random graph model to these networks over three other random graph models. Pržulj et al. (53) compared the frequencies of the appearance of all 3-5- node graphlets in these PPI networks with the frequencies of their appearance in four different types of random networks of the same size as the data: ER, ER-DD, SF-BA, and GEO. Furthermore, several variants of the geometric random graphs were used, depending on the dimensionality of the Euclidean space chosen to generate them: twodimensional (GEO-2D), three-dimensional (GEO-3D), and four-dimensional (GEO-4D) geometric random graphs. Four real-world PPI networks were analyzed: high-confidence and lower-confidence yeast PPI networks (14), and high-confidence and low-confidence fruitfly PPI networks (28). Pržulj et al. (53) computed RGF-distances between these real-world PPI networks and the corresponding ER, ER-DD, SF-BA and GEO random networks. They found that the GEO random networks fitted the data an order of magnitude better than other network models in the higher-confidence PPI networks, and less so (but still better) in the more noisy PPI networks. The only exception was the noisy fruitfly PPI network which exhibited scale-free behavior. It was hypothesized that this behavior of the graphlet frequency parameter was the consequence of a large amount of noise present in this network. Since currently available PPI data sets are incomplete, i.e., have a large percentage of false negatives or missing interactions, and thus are expected to have higher edge densities, Pržulj et al. (53) also compared the high-confidence yeast PPI network against 3-dimensional geometric random graphs with the same number of nodes, but about three and six times as many edges as the PPI network, respectively. By making the GEO- 3D networks corresponding to this PPI network about six times as dense as the PPI network, the closest fit to the PPI network with respect to RGF-distance was observed. Additionally, to address the existence of noise, i.e., false positives, in PPI networks, the high-confidence yeast PPI network was perturbed by randomly adding, deleting

39 36 and rewiring 10%, 20% and 30% of edges and RGF-distances between the perturbed networks, and the PPI network were computed. The study demonstrated that graphlet frequencies were robust to these random perturbations, thus further increasing the confidence in PPI networks having geometric network structure. Geometric structure of PPI networks has also been confirmed by GDD-agreements between PPI and model networks drawn from several different random graph models (53): ER, ER-DD, SF-BA, and GEO-3D. Several PPI networks of each of the following four eukaryotic organisms were examined: yeast S. cerevisiae, frutifly D. melanogaster, nematode worm C. elegans, and human. The total of fourteen PPI networks originating from different sources, obtained with different interaction detection techniques (such as Y2H, TAP, or HMS-PCI, as well as human curation), and of different interaction confidence levels were analyzed. GEO-3D network model showed the highest GDDagreement for all but one of the fourteen PPI networks; for the remaining network, GDD-agreements between GEO-3D, SF, and ER-DD models and the data were about the same. Additionally, an algorithm that directly tests whether PPI networks are geometric has been proposed (81). It does so by embedding PPI networks into a low dimensional Euclidean space. If a geometric network model fits the PPI network data, then it is expected that PPI networks can be embedded into the Euclidean space. The algorithm is based on Multi-Dimensional Scaling, with pathlengths playing the role of the Euclidean distances. The sensitivity and specificity of the fit are judged by computing the areas under the Receiver Operator Characteristic (ROC) curve. Geometric random graphs in 2-dimensional Euclidean space are generated by placing N nodes uniformly at random in the unit square, and by connecting two nodes by an edge if they are within a given Euclidean distance. The 3- and 4-dimensional cases are defined analogously. The task is then to embed the proteins into n-dimensional Euclidean space for n = 2, 3, 4, given only their PPI network connectivity information, so that the distances between the proteins are conserved. After proteins are embedded, the distance radius is varied.

40 37 Specificity and sensitivity are then measured for different radii, and the overall goodness of fit is recorded by computing the area under the ROC curve. The algorithm was applied to nineteen real-world PPI networks of yeast, fruitfly, worm, and human obtained from different sources, as well as to artificial networks generated using seven types of random graph models: ER, ER-DD, GEO (GEO-2D, GEO-3D, and GEO-4D), SF-BA, and STICKY. These networks were embedded into 2-dimensional (2D), 3-dimensional (3D), and 4-dimensional (4D) Euclidean space. The resulting areas under the ROC curve (AUCs) were high for all PPI networks. The highest AUC value was obtained for embeding the high-confidence yeast (YHC) PPI network. The authors focused their further analyses on YHC network. Random graphs of the same size as YHC network drawn from the seven network models were embedded into 2D, 3D and 4D space. For geometric random networks, AUCs were very high, with values above 0.9. For non-geometric networks AUCs were below Since PPI networks are noisy, to test whether PPI networks had a geometric structure, the authors added noise to GEO-3D networks by randomly rewiring 10%, 20% and 30% of their edges. These rewired networks were then embedded into 2D, 3D and 4D space, and their AUCs were computed. The values of AUCs for the 10% rewired GEO-3D networks were very similar to those for real-world networks. Thus, this method embeds networks into low dimensional Euclidean space. The method provides a direct test of whether PPI networks have a geometric graph structure. The results yield support to the results of previous studies and to the hypothesis that the structure of currently available PPI networks is consistent with the structure of (noisy) geometric graphs (81) Stickiness of PPI networks Another biologically motivated stickiness index -based network model has been proposed for PPI networks (55). It is commonly considered that proteins interact because they share complimentary physical aspects, a concept that is consistent with the underlying biochemistry. These physical aspects are referred to as binding domains (55).

41 38 Stickiness-index-based network model is based on stickiness indices of proteins in PPI networks, where a stickiness index of a protein is a single number that is based on its normalized degree and it summarizes the abundance and popularity of binding domains on the protein. The model assumes that a high degree of a protein implies that the protein has many binding domains and/or its binding domains are commonly involved in interactions. Additionally, the model considers that a pair of proteins is more likely to interact (i.e., share complementary binding domains) if both proteins have high stickiness indices, and less likely to interact if one or both have a low stickiness index. Thus, according to this model, the probability that there exist an edge between two nodes in a random graph is the product of the two stickiness indices of the corresponding proteins in the PPI network (see (55) for details). The resulting model networks are guaranteed to have the expected degree distributions of real-world networks. To examine the fit of this network model to real-world PPI networks, as well as to compare its fit against the fit of other network models, a variety of global and local network properties were used (55): the degree distribution, clustering coefficient, network diameter, and RGF-distance. In addition to the stickiness-index-based network model (STICKY), model networks were also drawn from the following network models: ER, ER-DD, SF-BA, and GEO-3D. The fit of fourteen real-world PPI networks of four organisms (yeast, fruitfly, worm, and human) to each of these five network models was evaluated with respect to all of the above mentioned network properties. With respect to RGF-distance, the stickiness model showed an improved fit over all other network models in ten out of fourteen tested PPI networks. It showed as good results as the GEO-3D model in one and was outperformed by the GEO-3D model in three PPI networks. In addition, this model reproduced well global network properties such as the degree distribution, the clustering coefficients, and the average diameters of PPI networks. Thus, this model using biologically motivated assumptions mentioned above clearly outperforms scale-free network models such as SF-BA and ER-DD that also match the degree distribution of a real-world PPI network.

42 Chapter 4 Protein function prediction from PPI network topology 4.1 Motivation The recent technological advances in experimental biology have yielded large amounts of biological network data. One such example is protein-protein interaction (PPI) networks. Remember that in these networks, nodes correspond to proteins and undirected edges represent physical interactions between them. Since a protein almost never acts in isolation, but rather interacts with other proteins in order to perform a certain function, PPI networks by definition reflect the interconnected nature of biological processes. Analyses of PPI networks may give valuable insight into biological mechanisms and provide deeper understanding of complex diseases. Defining the relationship between the PPI network topology and biological function and inferring protein function from it is one of the major challenges in the post-genomic era (82; 83; 84; 85; 86; 87; 88; 89). 39

43 Methods for protein function prediction There exist two major types of these approaches for determining protein function from the topology of PPI networks (56): direct methods and cluster-based methods Direct methods The methods of the first type are referred to as direct methods (56), since they consider that proteins that lie closer to one another in the PPI network are more likely to have similar function, thus assuming the correlation between network distance and functional distance. The simplest method of this type is the majority rule that investigates only the direct neighborhood of a protein, and annotates it with up to three most common functions of its annotated neighbors (84). The method was applied to the yeast PPI network consisting of 2,709 interactions amongst 2,039 yeast proteins, obtained by merging interactions from numerous publicly available data sources (17; 18; 37; 23). 72% of the proteins in the network were functionally annotated, 39% of which were annotated with more than one function. Motivated by the fact that 65% of the interactions in the network occurred amongst proteins with at least one common functional assignment, the authors predicted functions of unannotated proteins in the network based on the functions of their annotated neighbors. However, out of 28% of unannotated proteins in the network (i.e., out of 554 of them), only 364 proteins had at least one partner of known function, and only 69 had two or more partners of known function. Moreover, only 29 of these 69 proteins had two or more interacting proteins with at least one function in common (84), thus limiting the potential prediction space. The major drawbacks of this simple majority rule method are as follows. The approach does not assign any significance values to predicted functions. Additionally, it considers only nodes directly connected to the protein of interest and thus, only very limited topology of a network is used in the annotation process. Finally, it fails to differentiate between proteins at different distances from the target protein.

44 41 An approach has tried to overcome these limitations by observing n-neighborhood of a protein and by assigning the confidence to each predicted function, where n- neighborhood of a protein is defined as a set of proteins that are at most at distance n from the target protein (85). The confidence of a function that is to be assigned to the protein of interest is found by counting the number of proteins in its n-neighborhood having the predicted function, and by computing the expected number of proteins in its n-neighborhood having the predicted function, based on the frequency of the function among all proteins in the network. Then, each potential function is assigned the χ-square score computed based on these two values. Finally, the protein of interest is assigned the function with the highest χ-square value among functions of all n-neighboring proteins. The method was applied to predict three categories of yeast protein function : the subcellular localization, the cellular role, and the biochemical function (as defined in YPD (37)), in the PPI network consisting of 2,112 interactions (17; 18; 23). The following accuracies were achieved for the three categories: 72.7%, 63.6% and 52.7%, respectively. The authors provided predictions for 16 out of 409 unannotated proteins that had more than five binding partners, since predictions for proteins with higher degrees were achieved with higher accuracy. Although this approach covers larger portion of the network by observing n-neighborhood of a protein compared to the simple majority rule that observes only a direct 1-neighborhood, it still fails to distinguish between proteins at different distances from the protein of interest. This drawback has been addressed in the forthcoming study (90), which assigns different weights to proteins at different distances from the target protein. However, this method observes only 1- and 2-neighborhoods of proteins, thus again covering only their local topologies.

45 42 For this reason, several graph-cut, global optimization-based, proteome-scale function prediction strategies have been proposed (83; 82; 91). Here, we present the one by Vazquez et al. (83). According to their method, any given assignment of functions to the whole set of unclassified proteins in a network is given a score, counting the number of interacting pairs of nodes with no common function; the functional assignment with the lowest score maximizes the presence of the same function among interacting proteins (83). The method was applied to the yeast PPI network used by Schwikowski et al. (84) to assign up to three most probable functions to all unannotated proteins in MIPS (23), where the probable functions were those that occurred more often. To validate their method, the authors predicted functions for a number of already annotated proteins by hiding their annotations, demonstrating that a correct prediction can be made 60-70% of the time (83). An approach that reduces the computation requirements of this method has been proposed (92). A drawback of these graph-cut-based methods for protein function prediction is that they take into account global properties of the network, but again fail to distinguish between proteins at different distances from the protein of interest, thus not rewarding local proximity. For this reason, a network-flow-based method that considers both local and global effects has been proposed. (82). The method works as follows. Each functionally annotated protein in the network is considered as the source of a functional flow. Then, the spread of the functional flow through the network is simulated over time, and each unannotated protein is assigned a score for having the function based on the amount of flow it received during the simulation. Finally, several probabilistic direct methods have been introduced, all relying on a Markovian assumption that the function of a protein is independent of all other proteins given the functions of its immediate neighbors (87; 88; 91; 86). However, these probabilistic methods are out of the scope of this survey.

46 Cluster-based methods Approaches of the second type are exploiting the existence of regions in PPI networks that contain a large number of connections between the constituent proteins. These dense regions are a sign of a common involvement of those proteins in certain biological processes and therefore are feasible candidates for biological complexes. Cluster-based approaches then partition the network into clusters which are assumed to be functional modules, i.e., groups of cellular components and their interactions that enriches for biological functions (56), and instead of predicting functions of individual proteins, they assign the entire cluster with a function based on the functions of its annotated members. Various approaches for identifying these functionally enriched modules solely from PPI network topology have been defined (93; 94; 95; 96). The way how the modules are identified is what distinguishes the methods from one another. The highly connected subgraphs (HCS) algorithm (97) has been used to detect complexes in PPI networks (93), where a highly connected subgraph is defined as a subgraph with n nodes such that more than n/2 edges must be removed in order to disconnect it, thus ensuring that the diameter of the subgraph is at most two, and that it is at least half as dense as a clique of the same size. The HCS algorithm partitions the graph by finding the minimum graph cut and by repeating the process recursively until highly connected components are found. The restricted neighborhood search clustering (RNSC) algorithm has been defined to partition the set of nodes in the network into clusters by using a cost function to evaluate the partitioning (94). The algorithm starts with a random cluster assignment and proceeds by reassigning nodes, so as to maximize the scores of partitions. At the same time, the algorithm keeps a list of already explored partitions to avoid their reprocessing. Finally, the clusters are filtered based on their size, density and functional homogeneity.

47 44 The molecular complex detection algorithm (MCODE) (95) is the method that decomposes the PPI network into subnetworks by node weighting; a weighted form of the clustering coefficient is applied to increase the weights of heavily interconnected graph regions while giving small weights to the less connected nodes. Once the weights of all nodes are computed, the algorithm traverses the weighted graph in a greedy fashion to isolate densely connected regions. The algorithm based on superparamagnetic clustering (SPC) (98) has been shown to perform well in detecting dense structures that are loosely connected to other areas of the network (96). The major problem of these network-partitioning methods is that in some cases, the number of clusters or the size of the sought clusters need to be provided as input (96). Several iterative hierarchical-clustering-based methods that form clusters by computing the similarities between protein pairs have been proposed. Thus, the key decision with these methods is the choice of the appropriate similarity measures between protein pairs. The most intuitive network-topology-based measure is based on the pairwise distances between proteins in the network (56; 99): the smaller the distance between the two proteins in the PPI network is, the more similar they are, and thus, the more likely they are to belong to the same cluster. In other words, module members are likely to have similar shortest path distance profiles (100). However, distances between many protein pairs are identical, leading to the ties in proximity problem (99). Although using the shortest path length between proteins as a distance measure, Arnau et al. (99) attempted to overcome this problem by solving ties uniformly at random and thus obtaining multiple, equally valid hierarchical clustering solutions. Then, the fraction of the solutions in which a protein pair was clustered together was used as a similarity measure between the two proteins, and the final round of clustering was performed based on these newly computed similarities between protein pairs, using standard hierarchical algorithms.

48 45 Moreover, Czekanowski-Dice distance was used to assigns the maximum distance value to two proteins having no common interactors and zero value to those interacting with exactly the same set of proteins to form clusters of proteins sharing a high percentage of interactions (89). The major drawback of hierarchical clustering methods is that only global network properties are used. However, a sensitive graph theoretic method for comparing local network structures of protein neighborhoods in PPI networks, demonstrating that biological function of a protein and its local network structure are closely related, has recently been proposed (65). The method summarizes the local network topology around a protein in a PPI network into a vector of graphlet degrees called the signature of a node (i.e., the signature of a protein ), counting how many times the node touches each of the 73 orbits in 2-5-node graphlets shown in Figure 2.3. Then, the signature similarities between all protein pairs are computed, where the similarity of 1 means the identity of the signatures of two nodes (see (65) for details). To illustrate signature similarities, Figure 4.1 presents the signature vectors of yeast proteins in a PPI network with signature similarities above 0.90 (Figure 4.1 A) and below 0.40 (Figure 4.1 B); signature vectors of proteins with high signature similarities follow the same pattern (Figure 4.1 A), while those of proteins with low signature similarities have very different patterns (Figure 4.1 B). Proteins with topologically similar network neighborhoods are then grouped together under this measure according to the clustering method defined by Milenković and Pržulj: for a node of interest, a cluster is constructed containing that node and all nodes in a network that are similar to it; this is repeated for each node in the PPI network, thus allowing for overlapping clusters. The resulting protein groups have been shown to belong to the same protein complexes, perform the same biological functions, are localized in the same subcellular compartments, and have the same tissue expressions; this has been verified for numerous PPI networks of a unicellular and a multicellular eukaryotic organisms of yeast and human, respectively.

49 46 (A) (B) Figure 4.1: Signature vectors of proteins with signature similarities: (A) above 0.90; and (B) below The 73 orbits are presented on the abscissa and the numbers of times that nodes touch a particular orbit are presented on the ordinate in log scale. In the interest of the aesthetics of the plot, 1 is added to all orbit frequencies to avoid the log-function to go to infinity in the case of orbit frequencies of 0. The graphlet degree signatures-based method has several advantages over simple majority rule neighborhood approaches. Not only it that assigns a confidence score to each predicted annotation (in terms of hit- and miss-rates (see (65))), but also for doing that it takes into account up to 5-neighborhoods of a node along with their interconnectivities, since it is based on 2-5-node graphlets. Additionally, although the signature of a node describes its 5-deep local neighborhood, due to typically small diameters of PPI networks, it is possible that 2-5-node-graphlet-based signatures capture the full, or almost full topology of these networks. Thus, it overcomes all drawbacks of the direct methods. Moreover, the method does not require the number of clusters or their size to be predefined, unlike some of the other above mentioned cluster-based approaches. Furthermore, to create pairwise similarities between protein pairs, this method uses highly constraining local-topology-based measure of similarity of proteins signatures, unlike other studies that use only global network properties. Thus, this method seems to outperform the disadvantages of cluster-based methods as well.

50 47 It is difficult to perform direct comparisons of the performance of all methods described above. Attempts to perform a comparison of several cluster-based methods have been made (101). However, due to the different performance measures across different studies, inconsistent definitions of functional modules (amongst cluster-based methods), fundamental differences between different annotation types, and the lack of the golden standards for functional annotation, any comprehensive comparison is very difficult (56). Additionally, some studies have used the MIPS 1 (23) annotation catalogs, whereas other studies have used Gene Ontology 2 (102) as the annotation source, and some annotations that exist in one data source might not exist in the other. Despite these difficulties, some studies have tried to perform comparisons of different methods. Chua et al. (90) compared several direct methods and found that most of them exhibited similar performance, with the exception of the Markov random field (MRF) model (87), which outperformed other methods by a significant margin (90), possibly because of using a more sophisticated probabilistic model. Similarly, Brohee and van Helden (101) compared several module-assisted clustering algorithms and found that MCODE (95) and SPC (98) were inferior to other methods, such as RNSC (94). However, although the recent study by Milenković and Pržulj (65) was not included in these comparisons, it seems to overcome most disadvantages of other, both direct and cluster-based protein function prediction approaches (see the discussion above). Additionally, it is the only study that related the PPI network structure to all of the following: protein complexes, biological functions, and subcellular localizations for yeast, and cellular components, tissue expressions, and biological processes for human. Moreover, the authors did not only verify their predictions in the literature, but they also observed an overlap of the predicted protein functions obtained from multiple PPI networks for the same organism. Furthermore, there existed overlap between their protein function predictions and those of others. Finally, starting with the topology of PPI

51 48 networks of different organisms that were of different sizes and were originating from a wide spectrum of small-scale and high-throughput PPI detection techniques, their method identified clusters of nodes sharing common biological properties. Thus, the authors demonstrated that their method could provide valuable guidelines for future experimental research.

52 Chapter 5 Biological networks in disease In addition to protein function prediction, the focus of bioinformatics is on understanding the networks underlying human disease, such as cancer. First group of diseaserelated studies has been focusing on describing the topological properties of disease (cancer) genes in PPI networks and on identifying novel disease (cancer) gene candidates based on their topological properties. Second group of studies has been trying to better understand the relationships between diseases and the genes responsible for them. Finally, the last group of studies has been using graph theoretical approaches to analyze the networks of drugs and drug targets; additionally, these studies have been trying to relate drug target networks to PPI networks and to protein functions. 5.1 Disease gene identification Inspired by the findings that essential yeast proteins tend to have high degrees in PPI networks, several studies have attempted to perform similar analyses on disease-related genes (103). Numerous studies have been investigating associations between diseases and PPI network topology. 49

53 50 Jonsson and Bates (67) analyzed network properties of 346 genes that had been implicated in a comprehensive census of all human cancer genes (104). They demonstrated greater connectivity and centrality of cancer genes compared to non-cancer genes, indicating an increased central role of cancer genes within the interactome. More specifically, they showed that these proteins tended to have, on average, twice as many interaction partners as noncancer proteins. Additionally, after clustering the PPI network into overlapping clusters, the cancer proteins tended to reside in larger clusters and they tended to participate in more clusters than non-cancer proteins. Given these topological distinctions between disease and nondisease genes, some studies have tried to identify candidate disease genes from a human PPI network. For this purpose, network properties other that degrees of proteins can be used. For example, Radivojac et al. (66) did this by encoding each gene in the network based on the distribution of shortest path lengths to all genes associated with disease or having known functional annotation. Similarly to Jonsson and Bates (67), Goh et al. (68) initially observed that disease genes displayed a tendency to encode hubs in the interactome; they were found to have a 32% larger number of interactions with other proteins than the nondisease proteins. In other words, they discovered that high-degree proteins in the analyzed PPI network were more likely to be encoded by genes associated with diseases than low-degree proteins. However, they next showed that, despite this apparent correlation, the relationship between disease genes and their degrees needed more careful consideration. Starting from the fact that essential genes had higher degrees, the authors separated human disease genes into two groups: essential disease genes and non-essential disease genes. Then, they showed that these two classes of genes played quite different roles in the human interactome, by analyzing whether the observed correlation between disease genes and hubs could have been the sole consequence of the fact that 22% of disease genes were at the same time essential. The authors measured the degree dependence

54 51 of the nonessential disease proteins and surprisingly concluded that the correlation between hubs and disease proteins entirely disappeared compared to when all disease genes (including both essential and non-essential genes) were analyzed. Thus, the vast majority (78%) of disease genes, those that were nonessential, did not show a tendency to encode hubs, indicating (in contrast to the previously introduced study) that the observed correlation between hubs and disease genes was entirely due to the existence of essential genes within the disease gene class. Thus, some studies suggest higher degrees for disease genes, while others do not. Further work will be needed to resolve this discrepancy and carefully examine the different studies for possible sources of bias. A possible explanation is that the former study focused on cancer genes in particular, while the later examined disease genes in general. A potential source of bias, especially in literature-curated networks, is that disease-causing proteins may have higher degrees simply because they are better studied. All of the above presented studies have been mainly using global network properties for analyzing disease genes. However, these might not be detailed enough to encompass complex topological characteristics of disease genes in the context of PPI networks. Thus, Milenković and Pržulj (65) have devised a highly constraining method based on graphlet degree signatures of proteins in PPI networks that has been applied to disease gene identification in addition to protein function prediction (see Chapter 4). A set of genes implicated in genetic diseases available from HPRD 1 (41) was examined. To increase coverage of PPIs, the human PPI network that was analyzed was the union of the human PPI networks from HPRD (41), BIOGRID (47), and Rual et al. (31), consisting of 41,755 unique interactions amongst 10,488 different proteins. There were 1,491 disease genes in this PPI network out of which 71 were cancer genes. If network topology 1

55 52 Figure 5.1: Signature vectors of proteins belonging to the TP53 cluster. The cluster is formed using the threshold of The axes have the same meaning as in Figure 4.1. is related to disease and function, then it is expected that genes implicated in cancer might have similar graphlet degree signatures. To test this hypothesis, Milenković and Pržulj looked for all proteins with a signature similarity of 0.95 or higher with protein TP53. The resulting cluster contained ten proteins, eight of which were disease genes; six of these eight disease genes were cancer genes (TP53, EP300, SRC, BRCA1, EGFR, and AR). The remaining two proteins in the cluster were SMAD2 and SMAD3 which are members of TGF-beta signaling pathway whose deregulation contributes to the pathogenesis of many diseases including cancer (105). The striking signature similarity of this 10-node cluster is depicted in Figure 5.1. Thus, the potential disease genes can be predicted by creating the clusters of proteins with similar signatures and observing the enrichment of disease genes in each cluster. A more complete analysis of how topological clustering relates to diseases will be published in a forthcoming paper.

56 Disease networks One of the major challenges towards better understanding of disease in a cell is understanding relationships between diseases and the genes causing them. However, even with most simple Mendelian diseases, which are inherited and controlled by a single gene, the correlation between the gene mutations and the patient s symptoms might not be clear (106). This could be due to the ability of some genes to produce multiple phenotypes, environmental factors, or the influence of other genes (one gene could mask the phenotypic effect of the other, or a gene could modify another gene). Thus, even simple Mendelian diseases can lead to complex genotype-phenotype associations (106). Moreover, the problem of gene-disease association is further complicated by the fact that not only mutations in multiple genes could cause one disease, but multiple disorders could be caused by mutations in the same gene. Thus, the increasing knowledge about protein networks can be used towards identifying new genes and genetic mechanisms behind diseases. Whereas earliest studies have used graph theoretical tools to gain a better understanding of the relationship between the genes implicated in a selected disorder, thus focusing on a single disease, the recent study by Goh et al. (68) comprehensively explored the relationships between human genetic disorders and the corresponding disease genes, as well as between disease genes themselves, from a higher level of cellular organization. The authors combined the human disease phenome, a systematic linkage of all genetic disorders, with the disease genome, representing the complete list of disease genes, resulting in a global view of the diseasome, the combined set of all known associations between disorders and disease genes. The diseasome was constructed as a bipartite graph consisting of two disjoint sets of nodes: genetic disorders on one side, and disease genes on the other, where a disorder and a gene were linked if mutations in the gene were implicated in the disorder (see Figure 5.2 for illustration). The list of disorders, disease genes, and associations

57 54 between them was obtained from the Online Mendelian Inheritance in Man (107), the most complete and up-to-date repository of all known disease genes and the disorders they cause (68). Based on the diseasome bipartite graph, two network projections were constructed: the human disease network (HDN), in which nodes represented disorders and two disorders were connected to each other if they were both caused by at least one common gene, and the disease gene network (DGN), in which nodes represented disease genes, and two genes were connected if they were associated with at least one common disorder (see Figure 5.2). The authors have then examined the potential of these networks to better understand known disease gene and phenotype associations. The resulting HDN was far from being disconnected and it had many connections between disorders, suggesting that the genetic origins of most diseases, to some extent, were shared with other diseases. Most disorders were linked to only a few other disorders, whereas a few diseases (mostly cancer related) represented hubs connected to a large number of distinct disorders. In the DGN, which provided a complementary, gene-centered view of the diseasome, the number of genes involved in multiple diseases was found to decrease rapidly. However, several disease genes were involved in up to 10 disorders, representing major hubs in the network. The random shuffling of interactions between two partitions of the diseasome (before the network projections were made), showed that topologies of both HDN and GDN significantly deviated from random ones and indicated important pathophysiological clustering of disorders and disease genes (68). By overlaying the DGN on a network of human physical protein-protein interactions (31; 30), the authors found that 290 interactions overlapped between the two networks with a very high statistical significance. Furthermore, the authors found that genes that contributed to a common disorder showed an increased tendency for their products to interact with each other through protein-protein interactions, be expressed together in specific tissues, display high coexpression levels, exhibit synchronized expression as a group, and share GO terms (68).

58 55 Furthermore, the authors tried to examine topological properties of disease genes. However, unlike previous studies, they found that, as explained in more details in the previous section, the vast majority of disease genes were nonessential, and they did not show a tendency to encode hubs. Only essential disease genes were found to be responsible for the observed correlations between hubs and disease genes (68). Thus, the authors produced somewhat unexpected conclusion that nonessential disease genes were not associated with hubs, showed smaller correlation in their expression pattern with the rest of the genes in the cell than expected at random, and had a tendency to be expressed in only a few tissues. Therefore, contrary to earlier hypotheses and all expectations, the vast majority of nonessential disease genes occupied functionally peripheral and topologically neutral positions in the cellular network. In contrast, essential genes were likely to encode hubs and show highly synchronized expression with the rest of the genes, and were expressed in most tissues. Thus, only essential (disease) genes were found to be topologically and functionally central. 5.3 Characterizing drug-drug target relationships An assessment of the number of drug targets, i.e., molecular targets that represent an opportunity for therapeutic intervention, as well as their identification, is crucial to the development of post-genomic research strategies within the pharmaceutical industry (108). Now that the size of the human genome is known, it is interesting to consider just how many molecular targets this opportunity represents. Additionally, identifying and characterizing the relationships between drugs and their protein targets, as well as between drug targets and disease-gene products in the human protein-protein interaction network still remains a challenge (69).

59 56 Figure 5.2: Construction of the diseasome bipartite network. The figure is taken from (68) Druggable proteins Biological systems contain only four types of macromolecules with which small-molecule therapeutic agents can interfere: proteins, polysaccharides, lipids, and nucleic acids (108). Due to the inability to obtain potent compounds against the latter three macromolecule types, the majority of successful drugs achieve their activity by binding to, and modifying the activity of, a protein. This limits the molecular targets for which commercially viable compounds can be developed. Drug targets need to be able to bind compounds with appropriate properties, i.e., they need to be druggable. Thus, the druggable genome will be the subset of the 30,000 genes in the human genome that code for proteins able to bind drug-like molecules (108). The pharmaceutical industry has historically relied upon these druggable proteins against which chemists attempt to develop compounds with desired actions.

60 57 Most drugs act by binding to specific proteins. Since binding sites on proteins usually exist out of functional necessity, most successful drugs achieve their activity by competing for a binding site on a protein with another small molecule(s). This can cause severe changes in biochemical and/or biophysical activities of proteins, with multiple consequences on various functions. 3,051 of the predicted 30,000 genes in the human genome code for a protein with some precedent for binding a drug-like molecule (108). However, the ability of a protein to bind a small molecule with the appropriate chemical properties at the required binding affinity might make it druggable, but does not necessarily make it a potential drug target; at the same time, the protein has to be linked to a disease. Recent estimates propose that there are from 3,000 to 10,000 disease-related genes (108). The potential drug targets that the pharmaceutical industry can exploit are then captured in the intersection between the druggable genome and those genes related to disease, as shown in Figure 5.3. Figure 5.3: The estimated number of drug targets. The figure is taken from (108).

61 DrugBank None of the disciplines of cheminformatics or bioinformatics has really tried to integrate with one another. Only recently there have been some notable efforts to partially overcome this gap. The Therapeutic Target Database or TTD (109), KEGG (9), ChEBI (110), and PubChem 2 are such examples. However, none of these databases provides a comprehensive molecular summary of any given drug or its corresponding protein target. Thus, DrugBank (111), a single, fully searchable in silico drug resource that links sequence, structure and mechanistic data about drug molecules with sequence, structure and mechanistic data about their drug targets, has been introduced. Therefore, DrugBank is a dual purpose bioinformatics cheminformatics database with a strong focus on quantitative, analytic or molecular-scale information about both drugs and drug targets. In many respects it combines the data-rich molecular biology content with the equally rich chemical data, thus bringing these two disparate types of information together into one unified and freely available resource. This allows educators and researchers from diverse disciplines and backgrounds (academic, industrial, clinical, non-clinical) to conduct the type of in silico learning and discovery that is now routine in the world of genomics and proteomics. DrugBank currently contains more than 4,100 drug entries, corresponding to more than 12,000 different trade names and synonyms. DrugBank is divided into four major categories: FDA-approved small molecule drugs, FDA-approved biotech (protein/peptide) drugs, nutraceuticals or micronutrients such as vitamins and metabolites, and experimental drugs, including unapproved drugs, de-listed drugs, illicit drugs, enzyme inhibitors and potential toxins. DrugBank s coverage for non-trivial FDAapproved drugs is 80% complete. In addition, more than protein (i.e. drug target) sequences are linked to these drug entries. The entire database, including text, sequence, structure and image data is available from the DrugBank download webpage 2

62 59 and it occupies nearly 16 gigabytes of data, most of which can be freely downloaded. Moreover, DrugBank is a fully searchable web-enabled resource with many built-in tools and features for viewing, sorting and extracting drug or drug target data. In summary, DrugBank is a comprehensive, web-accessible database that brings together quantitative chemical, physical, pharmaceutical and biological data about thousands of well studied drugs and drug targets. It is hoped that DrugBank will serve as a useful resource to not only members of the pharmaceutical research community but to educators, students, clinicians and the general public Drug-target network Proteins function as part of highly interconnected cellular networks rather than in isolation. For this reason, recent studies are focusing on understanding drug targets in the context of cellular and disease networks (69). Graph theoretical tools have been combined with network biology and systematic information about drugs and their targets, in order to analyze properties of drug-targets from the perspective of cellular networks, to understand and describe relationships between drugs and their targets, and to quantify interrelationships between drug targets and the corresponding diseasegene products in PPI networks. Similarly to the diseasome described by Goh et al. (68), a bipartite graph consisting of drugs as nodes in one partition and their targets (i.e., proteins) in the other, with edges connecting a drug and a protein if the protein is the target of the drug, was constructed. This bipartite graph was denoted as the drug target network (DT network) (69). Two biologically relevant network projections were generated from the DT network: drug network, in which nodes represented drugs and two drugs were connected to each other if they shared at least one target protein, and target-protein network (TP network), in which nodes were proteins and two proteins were connected if they were both targeted by at least one common drug.

63 60 By observing the visualized DT network (shown in Figure 5.4), it was obvious that the network was clustered by major therapeutic classes, although it was created without any knowledge about drug classes. Moreover, it could be seen that the network displayed many connections between different drugs, indicating that majority of drugs had at least one link to other drugs, i.e., that they shared targets with other drugs. The similar held for targets in the TP network: the majority of targets were connected to other targets in the network. The highly interconnected TP network resulted from the existence of promiscuous drugs targeting multiple targets, thus suggesting that the drug industry was mainly focusing on already known targets when generating new drugs (69). This was true when only FDA-approved drugs and their targets were considered in the process of constructing DT network. When experimental drugs and their targets were considered as well, the size of the drug network did not significantly increase; however, the size of TP network did, suggesting that experimental drugs targeted more diverse set of target proteins. In the similar manner as the diseasome (68), the TP network (69) was overlaid onto the human PPI network (31; 30), resulting in 262 drug-target proteins being present in the PPI network. These proteins had higher degrees than other proteins in the PPI network. However, the authors found that, although drug-target proteins had high number of interacting partners, their degrees were not even closely as high as those of essential proteins, therefore suggesting that they did not show a trend towards greater essentiality (69). Additionally, the authors explored whether drugs, their corresponding target proteins, and disease-gene products in the PPI network might relate to each other at a high level of organization. They did this by measuring the minimum shortest path distances between drug targets and disease-gene products implicated in at least one common disease. 922 such drug target-disease gene pairs were identified. The actual mechanism through which a drug acts could be unknown, but the shortest distance estimated the number of molecular steps between a drug target and the corresponding gene causing the disease. The results suggested that most drugs are palliative, targeting

64 61 proteins that are not the actual cause of the disease, but whose activity can be perturbed to counteract the symptoms of disease-causing proteins. On the other hand, drugs targeting, for example, cancer, were found to be etiology-specific, directly targeting the actual cause of the disease. Although many efforts have been made, we are still far from understanding disease, since studying disease in the context of molecular networks faces many technological, biological, and algorithmic challenges. Human network data remains sparse and its completion is of crucial importance. However, data collection and interpretation is complicated by the large size of the human proteome and its diversity of cells and tissues. Additionally, many important types of networks, such as networks of regulatory, synthetic-lethal, or chemical-genetic interactions, are still forthcoming. This data will need to be integrated with protein interaction, protein structure, and gene expression data into a single framework. Moreover, existing computational frameworks are ill-suited to cope with the ongoing explosion in network-level measurements and information. The development of new computational tools to organize, visualize and integrate these data will provide a step forward in the direction of understanding the complex biological mechanisms. Furthermore, valuable information about disease and protein interactions is buried within millions of biomedical records. Text mining approaches are therefore essential to recover such information. Nonetheless, elucidating the mechanisms of human disease remains a holy grail of bioinformatics (103; 106). Future directions for better understanding of disease include the following. A systematic experimental genome-wide study of protein interactions between host and pathogen, which is not yet available in the literature, could provide insight into the bacteria, virus or parasite mechanisms of pathogenicity (103; 106). Moreover, since network and protein structural approaches are complementary, the combination of network studies with a more detailed analysis of protein structures has the potential to be an excellent framework for the study of disease mechanisms and rational design of drugs (106).

65 62 Figure 5.4: Drug-target network (DT network). The DT network is generated by using the known associations between FDA-approved drugs and their target proteins. Circles and rectangles correspond to drugs and target proteins, respectively. A link is placed between a drug node and a target node if the protein is a known target of that drug. The area of the drug (protein) node is proportional to the number of targets that the drug has (the number of drugs targeting the protein). Color codes are given in the legend. The figure is taken from (69).

66 Chapter 6 Network comparison Just as comparative genomics has led to an explosion of knowledge about evolution, biology, and disease, so will comparative proteomics. As more biological network data is becoming available, comparative analyses of these networks across species are proving to be valuable, since such systems biology types of comparisons may lead to exciting discoveries in evolutionary biology. For example, comparing networks of different organisms might provide deeper insights into conservation of proteins, their function, and protein-protein interactions through evolution. Conceptually, network comparison is the process of contrasting two or more interaction networks, representing different species, conditions, interaction types, or time points, aimed at answering some fundamental biological questions. 6.1 Types of network comparison methods Three different types of comparative methods exist (70). The most common methods for such network comparisons are network alignments. An alignment is achieved by constructing a mapping between the nodes of networks being compared as well as the corresponding interactions. In this process, topologically and functionally similar regions of biological networks are discovered. Depending on the properties of mappings, 63

67 64 network alignment can be local or global. Most of the research in previous years has been focused on local alignments. With local network alignment algorithms, optimal mappings are chosen independently for each local region of similarity. With global network alignments, one optimal mapping for the entire network is constructed, even though this may imply less perfect alignments in some local regions. Additionally, network alignment can be pairwise alignment or alignment across multiple species. The second type of methods is network integration, the process of combining several networks and encompassing interactions of different types over the same set of elements, to study their interrelations. Because each type of network lends insight into a different slice of biological information, integrating different network types may provide a more comprehensive picture of the overall biological system under study (70). The main conceptual difference from network alignment is as follows. Networks to be integrated are defined over the same set of elements (e.g., the set of proteins of a certain species), and the integration is achieved by merging them into a single network with multiple types of interactions, each drawn from one of the original networks. A fundamental problem is to identify in the merged network functional modules that are supported by interactions of multiple types. Thus, network integration can assist in predicting protein interactions and uncovering protein modules that are supported by interactions of different types (56). For example, Kelley et al. (112) studied the interrelations between protein-protein and genetic (synthetic lethal) interactions in yeast. They searched for two structures in the integrated network: pairs of subnetworks of protein-protein interactions interconnected to each other by a dense pattern of genetic interactions, and clusters enriched for both physical and genetic interactions. The first structure was found to be more prevalent, suggesting that genetic interactions tended to bridge genes operating in two pathways with redundant or complementary functions, rather than occurring between protein subunits within a single pathway.

68 65 The final mode of comparison is network querying, in which a given network is searched for subnetworks that are similar to a subnetwork query of interest. Network alignment and integration are focused on de novo discovery of biologically significant regions embedded in a network, based on the assumption that regions supported by multiple networks are functional. In contrast, network querying searches for a subnetwork that is previously known to be functional. The goal is to identify subnetworks in a given network that are similar to the query. However, network querying tools are still at an early stage and are currently limited to sparse topologies, such as paths and trees. Approaches to handle more general queries could benefit from the rich literature on graph mining techniques in the data mining community (70). One of the major challenges in performing network comparison is the noisiness of the network data. Systematic screens for protein interactions report large numbers of false-positive measurements, thus questioning which interactions represent true binding events. Confidence measures on interactions can and should be taken into account before network comparison; however, these are not always available. 6.2 Algorithms for network alignment A variety of network comparison method exists. Here, we focus mainly on methods for network alignment, and mostly on those that were applied to PPI networks. Due to the existence of the underlying subgraph isomorphism problem, any algorithm towards studying complex biological networks becomes computationally infeasible. In certain cases, for example, when the two networks being compared represent linear chains of interactions, the network alignment problem admits efficient algorithmic solutions. In general, the problem is computationally hard, and heuristic approaches have been sought (70).

69 66 Conceptually, to perform network alignment, a merged representation of the networks being compared, called a network alignment graph, is created. In a network alignment graph, the nodes represent sets of similar molecules, one from each network, and the links represent conserved molecular interactions across the different networks. An illustration is given in Figure 6.1. The alignment is particularly simple when there exists a one-to-one correspondence between molecules across the networks (global alignment), but in general there may be a complex many-to-many correspondence (local alignment). A network alignment graph facilitates the search for conserved network regions, as these will appear as subnetworks in the network alignment graph with specific structure. For instance, conserved protein complexes might appear as clusters of densely interacting nodes. Thus, there are two core challenges in network alignment. First, a scoring framework that captures the similarities between nodes originating in different networks must be defined. Then, a way to rapidly identify high-scoring alignments (i.e., conserved functional modules) from among the exponentially large set of possible alignments needs to be specified. Due to the computational complexity, a greedy algorithm is used for this purpose. Methods for network alignment differ in these two challenges, depending how they define the similarity scores between protein pairs and what greedy algorithm they use to identify conserved subnetworks. The problem of network alignment has been approached in different ways and a variety of algorithms have been developed. Unlike the majority of algorithms focusing on pairwise network alignments, newer approaches have tried to address the problem of aligning networks belonging to multiple organisms. Additionally, instead of performing local alignments, algorithms for global network alignment have emerged. Finally, whereas previous studies have focused on network alignment based solely on biological functional information such as protein sequence similarity, recent studies have been combining the functional information with network topological information (70).

70 67 Figure 6.1: Network alignment graph. Each node in this aligned network represents a set of similar proteins (one from each species) and each link represents a conserved interaction. Other than species, the networks being compared can also be sampled across different biological conditions or interaction types. The figure is taken from (70) Pairwise alignment of PPI networks In the most simple case, the similarity of a protein pair, where one protein originates from each of the networks being aligned, is determined solely by their sequence similarity. Then, the top scoring protein pairs are aligned between the two networks. The top scoring protein pairs are typically found by applying BLAST (113) to align all possible pairwise sequences of proteins in different networks, and the pairs with the lowest E-values are chosen. The most simple network alignment then identifies pairs of interactions in PPI networks, called interologs, involving two proteins in one species and their best sequence matches in another species (114). However, beyond the simple identification of conserved protein interactions, it is possible to identify network subgraphs that might be conserved between two protein networks. An algorithm called PathBLAST 1 (115) has been applied to find high-scoring paths in the alignment graph, by combining network topology and protein sequence 1

71 68 similarity information. The algorithm works as follows. PathBLAST searches for highscoring pathway alignments involving two paths, one from each network, in which proteins of the first path are paired with putative homologs occurring in the same order in the second path. Pathway alignments are scored by the degree of protein sequence similarity at each pathway position and by the quality of the protein interactions they contain. That is, the likelihood of a pathway match is computed by taking into account both the probabilities of true homology between proteins on the path and the probabilities that the protein-protein interactions that are present in the path are real, i.e., not false-positive errors. The score of a path is thus a product of independent probabilities for each aligned protein pair and for each protein interaction. The probability of a protein pair is based on the BLAST E-value of aligning sequences of the corresponding proteins, whereas the probability of a protein interaction is based on the false-positive rates associated with interactions in the target network. PathBLAST implements an efficient search through all possible alignments between two networks to identify the highest scoring pathway alignments overall. The search space can be reduced by specifying E-value threshold to discard all pairs of aligned proteins with BLAST E-value above the specified threshold. Evolutionary variations and experimental errors in pathway structure are accommodated by allowing gaps and mismatches in the algorithm. A gap occurs when interacting proteins in one path are aligned against orthologous proteins in the other path that do not interact directly but via a common protein. A mismatch occurs when aligned proteins do not share sequence similarity. The PathBLAST method has been extended to detect conserved protein clusters rather than paths, by deploying a likelihood-based scoring scheme that weighs the denseness of a given subnetwork versus the chance of observing such network substructure at random (116). PathBLAST was used to identify five regions that were conserved across the protein networks of Saccharomyces cerevisiae and Helicobacter pylori. PathBLAST was also used to show that the protein-protein interaction network of Plasmodium falciparum differed substantially from those of other eukaryotes (117).

72 69 MaWISh 2 (118) is a method for pairwise local alignment of PPI networks implementing an evolution-based scoring scheme to detect conserved protein clusters. This mathematical model extends the concepts of evolutionary events in sequence alignment to that of duplication, match, and mismatch in network alignment. The method evaluates the similarity between graph structures through a scoring function that accounts for these evolutionary events by weighting edges in order to reward or penalize for each event. Each duplication is associated with a score that reflects the divergence of function between the two proteins. The score is based on the protein sequence similarity and is computed by BLAST. A match corresponds to a conserved interaction between two orthologous protein pairs; thus, a match score reflects the confidence that both protein pairs are orthologous. A mismatch, on the other hand, is the lack of an interaction in the PPI network of one organism between a pair of proteins whose orthologs interact in the other organism. A mismatch may correspond to the emergence of a new interaction or the elimination of a previously existing interaction in one of the species after the split, or to an experimental error. After each match, mismatch, and duplication is given a score, the optimal alignment is defined a set of nodes with the maximum score, computed by summing all possible matches, mismatches, and duplications in the given set of nodes. MaWish algorithm was applied to detect conserved subnetworks in pairwise alignments of the PPI networks belonging to yeast, fly, and worm. In total, 412 common subnetworks were identified between yeast and fly, 83 between yeast and worm, and 146 between worm and fly. While most of the conserved subnets were dominated by one particular biological process and the dominant processes were generally consistent across species, there also existed different processes in different organisms that were mapped to each other by the discovered alignments. This illustrates that the comparative analysis of PPI networks is effective not only in identifying particular functional 2

73 70 modules, pathways, and complexes, but also in discovering relationships between different processes in separate organisms and uncovering the crosstalk that exists between known functional modules and pathways (118). Unlike MaWISh, ISORANK (119) is a method for pairwise global alignment of PPI networks, similar to Google s PageRank. ISORANK maximizes the overall match between the two networks by using both biological (i.e., BLAST-computed protein sequence similarity) and topological (i.e., protein interaction) information, contribution of each being a user-adjustable parameter. The algorithm considers weights of interactions, if available. Given two networks, the output of the algorithm is the maximum common subgraph between the two graphs, i.e., the largest graph that is isomorphic to subgraphs of both networks. Additionally, the algorithm outputs the corresponding node mapping such that each node is mapped to at most one node in the other network, with nodes not mapped to any other node being gaps. The algorithm works in two stages. It first associates a score with each possible match between nodes of the two networks. Given network and sequence data, an eigenvalue problem is constructed and solved to compute these scores. The scores are computed using the intuition that two nodes, one from each network, are a good match if their respective neighbors also match well with each other. The method captures not only local topology of nodes, but also non-local influences on the score of a protein pair: the score of the protein pair depends on the score of the neighbors of the two nodes, and the latter, in turn, depend on the neighbors of their neighbors, and so on. The incorporation of other information, e.g. BLAST scores, into this model is straightforward. The total score of a protein pair is then simply the sum of the weighted BLAST and topological scores; the two weights need to sum to 1. The second stage constructs the mapping by extracting from all protein pairs the high-scoring matches. An appealing approach is to construct the bipartite graph with each side containing

74 71 all the nodes from one network and two nodes originating in opposite partitions being connected with an edge that represents the score of aligning the two proteins. The optimal alignment solution can then be found efficiently by finding the maximum-weight bipartite matching for this graph. However, it was discovered that the repetitive greedy strategy of identifying and outputting the highest scoring pair and removing all scores involving any of the two identified nodes was more efficient. The algorithm was applied to find a global alignment of yeast and fly PPI networks and identify the (disconnected) conserved subgraph between them, consisting of 1,420 common edges. The largest connected component in the alignment had 35 interactions amongst 35 proteins. The contributions of topological and sequence information to produce this alignment were 0.6 and 0.4, respectively. Based on their alignment, and starting from the premise that proteins that are aligned together in the global alignment should have similar interaction patterns in their respective species and are thus likely to be functional orthologs, the authors have predicted functions for unannotated proteins in one species based on the function of annotated protein in the other species. Predictions produced by this method were consistent with those in previous studies Multiple PPI network alignment The major problem in network alignment, even when only two networks are being aligned, is the computational complexity. This issue grows further into the problem of computational scalability when large number of interaction networks are compared. The problem becomes even more serious as interaction data for more organisms becomes available. Despite these difficulties, strategies for alignment of networks belonging to multiple species have been proposed.

75 72 PathBLAST (115) was extended into a computational framework for alignment and comparison of more than two protein networks (120). In the similar manner as Path- BLAST, this process integrates interactions with sequence information to generate a network alignment graph. Each node in the graph consists of a group of sequencesimilar proteins, one from each species, and each link between a pair of nodes represents conserved protein interactions between the corresponding protein groups. Two types of conserved subnetwork structures were searched for: short linear paths of interacting proteins, which model signal transduction pathways, and dense clusters of interactions, which model protein complexes. The search is guided by reliability estimates for each protein interaction, which are combined into a probabilistic model for scoring candidate subnetworks. Under the model, a log likelihood ratio score is used to compare the fit of a subnetwork to the desired structure (path or cluster) versus its likelihood given that each species interaction map was randomly constructed. The underlying model assumptions are that in a real subnetwork, each interaction should be present independently with high probability, and that in a random subnetwork, the probability of an interaction between any two proteins depends on their total number of connections in the network. The search algorithm exhaustively identifies high-scoring subnetwork seeds and expands them in a greedy fashion. The significance of the identified subnetworks is evaluated by comparing their scores to those obtained on randomized data sets, in which each of the interaction networks is shuffled along with the protein similarity relationships between them (120). This PathBLAST-like multiple network alignment strategy was applied to compare the PPI networks of yeast, fly, and worm, and to systematically identify conserved protein subnetworks across these species. 71 conserved network regions that fell into well-defined functional categories were discovered (120). Two representative alignments are shown in Figure 6.2. To validate the results, the authors compared these clusters discovered by their algorithm to known complexes in yeast as annotated by MIPS (23).

76 73 94% of clusters had at least half of their annotated proteins sharing the same annotation. Thus, although any single network contains false-positive interactions, embedded beneath this noise are protein interaction complexes and pathways conserved across all three species. Additionally, starting from the premise that a conserved subnetwork that contains many proteins of the same known function suggests that the remaining proteins also have that function, the authors predicted thousands of new protein functions for the three organisms, with an estimated success rate of 58-63% (120). Whenever the set of proteins in a conserved cluster or path identified over all three species was significantly enriched for a particular Gene Ontology (GO) annotation (102) and at least half of the annotated proteins in the cluster or path had that annotation, all remaining proteins in the subnetwork were predicted to have the enriched GO annotation. Figure 6.2: Two representative alignments of conserved protein subnetworks across yeast, worm and fly. The figure is taken from (120). Graemlin 3 (121) is a method for multiple network alignment that overcomes the major drawbacks of all previous algorithms of such type. It is fast and scalable. It is the first program capable of multiple alignment of an arbitrary number of networks, 3