Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer 1
Content What is Community Detection? Motivation Defining a community Methods to find communities Overlapping communities Clique percolation method Finding a community with query nodes Conclusion 2
What is Community Detection? Different from traditional clustering Algorithms use the graph property Graphs with a natural origin have a structure that is not random We try to find these structures by analyzing the graph A perfect solution has yet to be found 3
Motivation Communities can represent parts of a larger system (Like organs in the human body) Communities can be considered as a summary of the graph Communities make it easy to visualize and understand complex systems Communities on the web might represent pages of related topics Community can reveal the properties without releasing the individual privacy information 4
Defining a Community There is not exact definition of a community in a graph It depends on the application A general definition: Separation between nodes in different communities Cohesion between nodes in a community The differences between algorithms come down to the precise definition 5
Basics For a Graph G = {V, E} and a subgraph C G with G = V = n and C = nc φint(c) should have a higher value than the whole graph and φext(c) should be much lower Local definitions see communities as an autonomous entity within a larger system Global definitions see the communities as essential parts of a larger system Vertex similarity: compare individual nodes and group them based on a similarity measure 6
Methods Finding overlapping communities Clique percolation method (CPM) Finding communities with query nodes 7
Clique Percolation Method CPM is based on the idea that communities are likely to consist of cliques Assumption: Every node in the same community is connected to nearly every other node A community is build up by a chain of k-cliques which are adjacent. Two k-cliques are adjacent if they share k-1 nodes The largest possible chain is defined as community This is a local definition 8
Implementation of CPM The number of possible k-cliques in a graph is quite high Implementations search for maximal k-cliques (NP-hard problem) We build an clique-clique overlap matrix O All entries smaller than k-1 are removed 9
Parameter k = 3; k = 4 The results of processing the example graph with the CFinder software 10
Drawbacks Even if the underlying problem is NP-hard, for large sparse graphs, this algorithm is reasonably fast Some cases lead to useless results: It looks for cliques not dense subgraphs It requires a large number of cliques, but not too many 11
Finding a community with query nodes The goal is to find a subgraph H that contains a given set Q of query nodes and is densely connected. The function f is maximized among all possible choices for H In this case we choose the minimum degree for f Additionally we add a distance constraint d 12
Without size restriction - Greedy algorithm Choose f = f(h) = minimum degree of a node in H We set G0=G then repeat the steps: Obtain Gt+1 by removing a node which violates the distance constraint or has the minimum degree Terminate if either one of the query nodes has minimum degree or the query nodes are no longer connected We choose the component of Gt for which the minimum degree f(h) is maximized This can be implemented in O(n+m) 13
Q = {1, 2, 3} The greedy algorithm, without size constraint, applied on the example graph 14
Communities with size restriction A size constraint k makes the problem NP hard (Can be shown via a reduction to the Steiner tree problem) But it can be assumed that the size of the result set is correlated with the distance constraint The paper proposes two heuristics: GreedyDist repeatedly executes Greedy and decreases d until the size k of the graph is small enogh GreedyFast restricts the graph to the k closest nodes to the query nodes. Then Greedy is invoked 15
Evaluation with the DBLP dataset The goal was to find a network of scientific collaboration around Christos Papadimitriou 16
Conclusion A really broad topic with lots of applications Each algorithms is build with different problems in mind Algorithms are difficult to compare, there is no standard way of testing 17
Bibliography [1] P. Erdos and A. Renyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5:17 61, 1960. [2] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 174, 2010. [3] P. F. Jonsson and P. A. Bates*. Global topological features of cancer proteins in the human interactome. Bioinformatics, 2291 2297, 2006. [4] T. H. J. S. J.-P. O. K. Kaski. Spectral and network methods in the analysis of correlation matrices of stock returns. Physica A 383, 147 151, 2007. [5] J. M. Kumpula, M. Kivelä, K. Kaski, and J. Saramäki. Sequential algorithm for fast clique percolation. Phys. Rev. E, 78:026109, Aug 2008. [6] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping com- munity structure of complex networks in nature and society. Nature, 435:814 818, June 2005. [7] M. E. Porter, K. Schwab, M. E. Porter, K. Schwab, F. Paua, E. T. Herrera, and M. Porter. Communities in networks. Notices of the American Mathematical Society, 1164 1166, 2009. [8] M. Sozio and A. Gionis. The community-search problem and how to plan a successful cocktail party. In Proceedings of the 16th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining, KDD '10, 939 948, New York, NY, USA, 2010. ACM. [9] K.-F. W. Wei Gao. Information Retrieval Technology. Springer Berlin Heidelberg, 2008. 18