Crawling and Detecting Community Structure in Online Social Networks using Local Information TU Delft - Network Architectures and Services (NAS) 1/12
Outline In order to find communities in a graph one needs the full graph. Crawling large Datasets like Online Social Networks takes very long. Facebook: 901 million (active April 2012), Twitter: Over 140 million (active March 2012) Ideal Crawling with one PC: 1s per request: Facebook 29years, Twitter: 4,5years 1. Crawling BFS/DFS/RFS Mutual Friend Crawling (MFC) the Reference Score Performance 2. Community Detection The Reference Score Compared to well known methods 3. Conclusion 2/12
Crawling Online Social Networks via Breadth/Depth first Search 1 1 1 2 3 4 2 i 1 4 2 6 5 6 7 8 9 10 11 3 4 i i 2 n 9 7 5 3 8 11 10 standard Breadth First Search But unfortunately Social Networks are not tree like standard Depth First Search What most people do (Random First Search RFS) using a BFS/DFS/RFS leads to a sampling bias by using any of these methods and the fact one has to wait until the full graph is crawled to detect communities. 3/12
Crawling Online Social Networks via Mutual Friend Crawling Our proposed method Mutual Friend Crawling (MFC) overcomes this situation by crawling a Graph from any given seed point, Community wise. MFC is based on BFS/DFS plus one assumption: the degree of neighboring nodes is known and keeps a Reference Score S R This in the search trajectory the next node to be next node to visit is the one having the highest S R 4/12
Crawling Online Social Networks via Mutual Friend Crawling Example: Starting with node 2: its neighbors are 0,1,3,4 with degrees Lets take 4 the Reference Scores are: 0:0.2, 1:0.2, 3:0.25, 4:0.2 5/12
Crawling Online Social Networks via Mutual Friend Crawling - Performance BFS (blue) DFS(green) MFC(red) American Football network (Newman et al.) 6/12
Community Detection in OSNs via Mutual Friend Crawling How is the reference Score behaving while MFC is traversing the graph. As there MFC stays in communities the reference score is always increasing denoting that the community is tightly connected. As soon as there is a drop in S R a new community is been found. This drop is largest if an expressed community structure can be found. Otherwise it will be small 7/12
Community Detection in Online Social Networks via Mutual Friend Crawling Problem of misclassification If starting with a hub (11), the nodes 10 and 21 are classified as being in the same community as node 11 (the first community). Solution: after finishing a community check if the nodes in this community should really be in this community 8/12
Conclusion & Future Work We proposed an algorithm to crawl online social networks community wise in order to minimize sampling bias in communities. to be able to analyze data while still crawling the network The algorithm detects communities, (even for directed and weighted graphs) Future work: overlapping communities formalism to understand the drop in the reference score in order to catch how structured a graph is. (compared to modularity) 9/12
Thank you for your attention Questions Delft University of Technology Faculty of Electr. Engineering Dept. of Telecommunication Mekelweg 4 2628 CD Delft The Netherlands Room: EWI 19.240 10/12
Crawling Online Social Networks via Mutual Friend Crawling - Performance In order to measure the performance we were looking for ground truth datasets As it is very hard to find some real world datasets where the community partition is known we came up with a Cluster Graph Generator 1. node generation and slot assignment 2. assigning nodes to clusters 3. creating the links 4. force the generation of a giant connected component (GCC) Has the possibility to generate arbitrary (predefined) community size distributions Multiple community detection algorithms were tested on the ground truth 11/12