0 value2. 3. Assign labels to the proteins in each interval of the ranked list

Size: px

Start display at page:

Download "0 value2. 3. Assign labels to the proteins in each interval of the ranked list"

Julian Leonard
8 years ago
Views:

1 Ranked list of protein degrees in decreasing order Proteins with few connections Proteins with many connections 1. Sort a random number between 80 and 98 (value1) 0 value1 2. Sort a random number between 60 and value1 (called value2) 0 value2 value1 3. Assign labels to the proteins in each interval of the ranked list 0 value2 value1 Low-Degree Proteins Middle-Degree Proteins High-Degree Proteins Supplementary Figure 1: Procedure to assign nodes to different connectivity categories We defined three categories to classify the proteins in a network according to their connectivity: Low-, middle- and High-degree. Instead of using an absolute number of interactions that each protein should have to belong to a certain category, we established that the proteins should be between a certain percentage of data in order to fit in a certain category. We create a ranked list of proteins according to their number of connections. This list has values in the interval [0, 100]. High values represent the most connected proteins of the network. The first step is to randomly select a value in the interval [80,98]. We call it value1. Now, the proteins present in the list in the interval [value1] are considered high-degree proteins. They are the top 2-20% most connected proteins of the network. The second step is to randomly select another value in the interval [60, value1]. We call it value2., the proteins in the interval [value2, value1] are the middle-degree proteins. They occupy the middle 20-38% of the ranked list. Finally, after these two values are selected and the respective high- and middle-degree categories created, we define that the low-degree proteins occupy the bottom 60% of the list. When performing pair-wise comparisons, we used the same values of value1and value2for both databases. In addition, this procedure was repeated 100 times and the number of agreements (proteins that changed or did not change between categories) was recorded. The result can be seen in Figure 1, where we show the average and standard deviation of the 100 repetitions of this procedure.

Assign labels to the proteins in each interval of the ranked list 0 value2 value1 Low-Degree Proteins Middle-Degree Proteins High-Degree Proteins Supplementary Figure 1: Procedure to assign nodes to

2 A B Supplementary Figure 2: Average number of interactions and betweenness distribution of all evaluated databases (A) Shown is the distribution of the average number of interaction partners in the databases analyzed. (B) Shown is the betweenness (i.e., centrality ) of proteins in a network. The degree and betweenness distribution is similar among all databases.

partners in the databases analyzed. (B) Shown is the betweenness (i.e., centrality ) of proteins in a network.

3 90,000 80,000 Number of Interactions 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 MINT INTACT BIOGRID HPRD MINT INTACT BIOGRID INTACT BIOGRID BIOGRID HPRD HPRD HPRD HIPPIE HIPPIE HIPPIE HIPPIE MINT MINT INTACT Shared Interactions Number of Interactions 200, , , , , ,000 80,000 60,000 40,000 20,000 0 HPRD HIPPIE BIOGRID MINT INTACT Shared Interactions Supplementary Figure 3: Absolute numbers of exclusive and shared interaction partners in pairwise comparisons of PPI datasets We first identified proteins shared between two databases. For these proteins, we identified their interaction partners in each of the databases, and then compared the interaction partners. For all pair-wise comparisons, yellow and blue represent the indicated databases; Shared (red) denotes the number of interaction partners shared between databases.

Supplementary Figure 3: Absolute numbers of exclusive and shared interaction partners in pairwise comparisons of PPI datasets We first identified proteins shared between two databases.

4 Organ/Cell-type Specific genes [%] Supplementary Figure 4: Percentage of organ/cell type-specific genes represented in the six PPI databases For genes expressed in 84 different human organs/cell types, we determined the level of coverage for the proteins in each of the databases analyzed. All databases have a relatively even coverage of all organs and cell types, although the number of genes expressed varies significantly between the different organs/cell types (Supplementary Figure 5).

proteins in each of the databases analyzed.

5 Supplementary Figure 5: Numbers of genes expressed in 84 different human organs or cell types Microarray data was obtained from different human tissues in duplicates and averaged. This data was pre-processed with GCRMA - GeneChip Robust Multiarray Averaging. Shown are the numbers of genes with moderate to high transcription levels (See Methods).

6 Remaining Interactions [%] Supplementary Figure 6: Organ/cell type-specific subnetworks By combining PPI datasets with data of organ/cell type-specific gene expression, we generated subnetworks that are limited to interactions with both partners expressed in the same organ or cell type; also included are PPIs among house-keeping proteins expressed in all cell types. As expected, this analysis reduced the number of interactions significantly, as compared to the parent PPI database. On average, the organ/cell type-specific subnetworks encompass 1-25% of the original interactions. Although the number of interactions is reduced significantly compared to the parent database, the subnetworks still have several thousand interactions. Depicted is the percentage of remaining edges from the original database filtered by organ/cell-type specific genes.

As expected, this analysis reduced the number of interactions significantly, as compared to the parent PPI database.

7 Number of Components Supplementary Figure 7: Number of connected components according to organ/cell-type subnetwork After filtering the parent PPI network to contain only interactions of proteins expressed in the same organ/cell-type, the subnetwork is fragmented into several connected components. We considered a connected component a connection of at least two proteins (i.e. the singletons in the network are ignored). We observe that database is the least fragmented among all databases compared and that certain tissues subnetworks (liver, heart, lung) have more connected components than others (ovary, olfactory lobe, skin).

We considered a connected component a connection of at least two proteins (i.e. the singletons in the network are ignored).

8 Supplementary Figure 8: Relation between the number of organ/cell type-specific genes and remaining interactions In tissues with a high number of expressed genes, more interactions from the original database remained. Consequently, we observed a strong correlation between the number of organ/cell type-specific genes and the remaining interactions per subnetwork. In all cases, the Pearson's correlation coefficient between the number of genes and the number of interactions in the subnetworks was greater than 0.96 (p-value < 2-2e-16), indicating that the number of interactions seems to increase proportionally to the number of nodes.

Consequently, we observed a strong correlation between the number of organ/cell type-specific genes and the remaining interactions per subnetwork.

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34 Network Analysis BCH 5101: Analysis of -Omics Data 1/34 Network Analysis Graphs as a representation of networks Examples of genome-scale graphs Statistical properties of genome-scale graphs The search