Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1
Outline Background Graph database Large graph processing Social networks analysis Conclusion 2015/4/20 2
Background Graphs are everywhere Internet social network biological network 3
Background Graph processing Online query processing OLTP workloads for quick low-latency access to small portions of graph data Offline graph analysis OLAP workloads allowing batch processing of large portions of a graph Graph database & graph mining system e.g. Neo4j, Pregel 2015/4/20 4
Graph Database What is graph database graph database model: node, edge, property Storage is optimized for data represented as a graph Storage is optimized for the traversal of the graph Flexible data model 2015/4/20 5
Graph Database Why graph database Focus on relationships between entities Provides a greater level of data complexity Ease of data modeling. graph database vs. relational database Relational databases are well fitted to findall-like queries Graph databases are suited for exploring relationships 2015/4/20 6
Graph Database e.g. Represent a business problem and associated entities 2015/4/20 7
Graph Database: an example Neo4j Property Graph Model Supports ACID (atomicity, consistency, isolation, durability) 2015/4/20 8
Large-scale Graph Large graph processing challenges They exceed memory and even disks of a single machine Computational ability on a single machine is limited Solutions Distributed parallel processing 9
Large Graph Processing Systems MapReduce-based Pegasus Computation model is MapReduce A large graph mining library on top of Hadoop/MapReduce BSP-based Pregel Adopts BSP (Bulk Synchronous Processing) programming model A large graph processing library on the top of BSP 10
Large Graph Processing System: Pegasus MapReduce programming model Map function input: a key/value pair output: a set of intermediate key/value pairs Reduce function input: a set of values for an intermediate key output: a set of key/value pairs 2015/4/20 11
Large Graph Processing System: Pegasus e.g. count the number of occurrences of each word 2015/4/20 12
Large Graph Processing System: Pegasus GIM-V (Generalized Iterated Matrix-Vector multiplication) M v = v where v n i = j=1 m i,j v j m 1,1 m 1,n m n,1 m n,n v 1 v n = m 1,1 v 1 + m 1,2 v 2 + + m 1,n v n m n,1 v 1 + m n,2 v 2 + + m n,n v n = v 1 m 1,1 m n,1 + + v n m 1,n m n,n combine2: multiply m i,j and v j combineall: sum n multiplication results for node i assign: overwrite previous value of v i with new result to make v i 2015/4/20 13
Large Graph Processing System: Pegasus Application: PageRank (calculate relative importance of web pages) m 1,1 m 1,n m n,1 m n,n v 1 v n = m 1,1 v 1 + m 1,2 v 2 + + m 1,n v n m n,1 v 1 + m n,2 v 2 + + m n,n v n = v 1 m 1,1 m n,1 + + v n m 1,n m n,n M : a transition matrix, v : rank vector, v : a new rank vector input: an edge file and a vector file Stage 1: performs combine2 operation by combining columns of matrix with rows of vector, outputs key/value pairs Stage 2: combines all partial results from Stage 1 and assigns new vector to the old 2015/4/20 14
Large Graph Processing System: Pregel BSP (Bulk Synchronous Parallel) model 2015/4/20 15
Large Graph Processing System: Pregel Google s implementation of BSP Node -> Vertex Message passing Combiners Aggregators Vertex ID Vertex Value 2015/4/20 16
Large Graph Processing System: Pregel Application: PageRank Initializes the value of each vertex in superstep 0 Vertex sends along each outgoing edges its tentative PageRank divided by edges Each vertex sums up the values arriving on messages into sum and calculate its tentative PageRank in each superstep Terminates when convergence is achieved 2015/4/20 17
Introduction to Social Networks A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest Social network analysis (SNA) is the study of social networks to understand their structure and behavior 2015/4/20 18
Data Mining for Social Network Analysis Community Detection Link Prediction Search in Social Networks Trust in Social Networks Characterization of Social Networks Other Research Topics in Social Networks 2015/4/20 19
Community Detection Discovering communities of users in a social network Community a tightly-knit region of the network Has strong internal node-node connections Weaker external connections Community detection algorithms stress high internal connectivity and low external connectivity with a given community 2015/4/20 20
Girvan-Newman Algorithm Calculate edge-betweenness for all edges Remove the edge with highest betweenness Recalculate betweenness Repeat until all edges are removed, or modularity function is optimized (depending on variation) 2015/4/20 21
Girvan-Newman Algorithm Edge Betweenness Measurement of contributions of an edge to all shortest paths Calculating all-shortest paths between two vertices If there are N paths between any two vertices, each path gets a weight equal to 1/N Edge Betweenness Example EA D-B +0.5 E-B +0.5 E-A +1 Total =2 A E C B D 2015/4/20 22
Girvan-Newman Algorithm: Example 2015/4/20 23
Girvan-Newman Algorithm: Example Betweenness(7-8)= 7x7 = 49 Betweenness(1-3) = 1X12=12 Betweenness(3-7)=betweenness(6-7)=betweenness(8-9) = betweenness(8-12)= 3X11=33 2015/4/20 24
Girvan-Newman Algorithm: Example Betweenness(1-3) = 1X5=5 Betweenness(3-7)=betweenness(6-7)=betweenness(8-9) = betweenness(8-12)= 3X4=12 2015/4/20 25
Girvan-Newman Algorithm: Example Betweenness of every edge = 1 2015/4/20 26
Link Prediction Predict likely interactions, not explicitly observed, based on observed links Primarily used to predict the possibility of new friends, study friend structures and co-authorship networks. Given a snapshot of a social network, it is possible to infer new interactions between members who have never interacted before 2015/4/20 27
Link Prediction Methods Given the input graph G, a connection weight score(x,y) is assigned to a pair of nodes <x,y> A ranked list is produced in decreasing order of score(x,y) It can be viewed as computing a measure of proximity or similarity between nodes x and y 2015/4/20 28
Link Prediction Methods Node Neighborhood Based Methods Common neighbors Jaccard s coefficient Adamic-Adar All Paths Based Methodologies PageRank SimRank Higher Level Approaches Clustering 2015/4/20 29
Node Neighborhood Based Methods Common neighbors socre u, v = N u N v Jaccard s coefficient socre u, v = N u N v / N u N v Adamic-Adar score(u, v) = zεn(u) N(v) 1 log(n(z)) 2015/4/20 30
All Paths Based Method: PageRank PageRank is one of the algorithms that aims to perform object ranking. The assumption PageRank makes is that a user starts a random walk by opening a page and then clicking on a link on that page. 2015/4/20 31
All Paths Based Method: SimRank SimRank is a link analysis algorithm that works on a graph G to measure the similarity between two vertices u and v in the graph. For the nodes u and v, it is denoted by s(u,v) [0,1]. If u=v then, s(u,v)=1 The definition iterates on the similarity index of the neighbors of u and v itself. s u, v = C N u N v a N(u) b N(v) s(a, b) 2015/4/20 32
Conclusion Online query processing Graph database Neo4j Graph Processing Offline graph analysis Large graph mining systems Social Network Analysis Pegasus Pregel Community Detection Link prediction 2015/4/20 33
References Angles R, Gutierrez C. Survey of graph database models[j]. ACM Computing Surveys (CSUR), 2008, 40(1): 1. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[j]. Communications of the ACM, 2008, 51(1): 107-113. Kang U, Tsourakakis C E, Faloutsos C. Pegasus: A peta-scale graph mining system implementation and observations[c]//data Mining, 2009. ICDM'09. Ninth IEEE International Conference on. IEEE, 2009: 229-238. Kang U, Tsourakakis C E, Faloutsos C. Pegasus: mining peta-scale graphs[j]. Knowledge and information systems, 2011, 27(2): 303-325. Malewicz G, Austern M H, Bik A J C, et al. Pregel: a system for large-scale graph processing[c]//proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146. Shao B, Wang H, Xiao Y. Managing and mining large graphs: systems and implementations[c]//proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012: 589-592. 2015/4/20 34
References Newman, Mark EJ. "Modularity and community structure in networks." Proceedings of the National Academy of Sciences 103.23 (2006): 8577-8582. Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010. Girvan, Michelle, and Mark EJ Newman. "Community structure in social and biological networks." Proceedings of the National Academy of Sciences 99.12 (2002): 7821-7826. Liben Nowell, David, and Jon Kleinberg. "The link prediction problem for social networks." Journal of the American society for information science and technology 58.7 (2007): 1019-1031. Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structural-context similarity." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002. 2015/4/20 35
Thank You