Graph Processing and Social Networks



Similar documents
Practical Graph Mining with R. 5. Link Analysis

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Big Data Analytics. Lucas Rego Drumond

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Map-Based Graph Analysis on MapReduce

Large Scale Social Network Analysis

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Evaluating partitioning of big graphs

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

DATA ANALYSIS II. Matrix Algorithms

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Scaling Up HBase, Hive, Pegasus

Machine Learning over Big Data

BSPCloud: A Hybrid Programming Library for Cloud Computing *

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Social Media Mining. Network Measures

Big Data and Scripting Systems build on top of Hadoop

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Introduction to Graph Mining

Big Graph Processing: Some Background

Big Data and Apache Hadoop s MapReduce

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Big Data and Scripting Systems beyond Hadoop

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

EBISS, 20 of July 2012 Brussels. Large Graph Mining. Recent Developement, Challenges and Potential Solutions

Analysis of Web Archives. Vinay Goel Senior Data Engineer

HIGH PERFORMANCE BIG DATA ANALYTICS

Estimating PageRank Values of Wikipedia Articles using MapReduce

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Lecture Data Warehouse Systems

Graph Mining and Social Network Analysis

Apache Hama Design Document v0.6

The PageRank Citation Ranking: Bring Order to the Web

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

imgraph: A distributed in-memory graph database

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Affinity Prediction in Online Social Networks

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

MapReduce Approach to Collective Classification for Networks

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Mining Social-Network Graphs

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

SGL: Stata graph library for network analysis

Information Processing, Big Data, and the Cloud

Social Media Mining. Graph Essentials

MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping

Part 2: Community Detection

Large Scale Graph Processing with Apache Giraph

Mining Social Network Graphs

Distance Degree Sequences for Network Analysis

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Challenges for Data Driven Systems

Hadoop MapReduce using Cache for Big Data Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

HiBench Installation. Sunil Raiyani, Jayam Modi

The Current State of Graph Databases

Warshall s Algorithm: Transitive Closure

Large-Scale Data Processing

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

MapReduce and the New Software Stack

Characterizing Task Usage Shapes in Google s Compute Clusters

Can the Elephants Handle the NoSQL Onslaught?

Introduction to Parallel Programming and MapReduce

Social Network Discovery based on Sensitivity Analysis

Log Mining Based on Hadoop s Map and Reduce Technique

Online Estimating the k Central Nodes of a Network

Hadoop SNS. renren.com. Saturday, December 3, 11


Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Delta-SimRank Computing on MapReduce

Machine Learning Big Data using Map Reduce

Extracting Information from Social Networks

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Begets Big Database Theory

Systems and Algorithms for Big Data Analytics

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Review on the Cloud Computing Programming Model

InfiniteGraph: The Distributed Graph Database

NoSQL for SQL Professionals William McKnight

Microblogging Queries on Graph Databases: An Introspection

Big Data With Hadoop

Graph Database Applications and Concepts with Neo4j

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Transcription:

Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1

Outline Background Graph database Large graph processing Social networks analysis Conclusion 2015/4/20 2

Background Graphs are everywhere Internet social network biological network 3

Background Graph processing Online query processing OLTP workloads for quick low-latency access to small portions of graph data Offline graph analysis OLAP workloads allowing batch processing of large portions of a graph Graph database & graph mining system e.g. Neo4j, Pregel 2015/4/20 4

Graph Database What is graph database graph database model: node, edge, property Storage is optimized for data represented as a graph Storage is optimized for the traversal of the graph Flexible data model 2015/4/20 5

Graph Database Why graph database Focus on relationships between entities Provides a greater level of data complexity Ease of data modeling. graph database vs. relational database Relational databases are well fitted to findall-like queries Graph databases are suited for exploring relationships 2015/4/20 6

Graph Database e.g. Represent a business problem and associated entities 2015/4/20 7

Graph Database: an example Neo4j Property Graph Model Supports ACID (atomicity, consistency, isolation, durability) 2015/4/20 8

Large-scale Graph Large graph processing challenges They exceed memory and even disks of a single machine Computational ability on a single machine is limited Solutions Distributed parallel processing 9

Large Graph Processing Systems MapReduce-based Pegasus Computation model is MapReduce A large graph mining library on top of Hadoop/MapReduce BSP-based Pregel Adopts BSP (Bulk Synchronous Processing) programming model A large graph processing library on the top of BSP 10

Large Graph Processing System: Pegasus MapReduce programming model Map function input: a key/value pair output: a set of intermediate key/value pairs Reduce function input: a set of values for an intermediate key output: a set of key/value pairs 2015/4/20 11

Large Graph Processing System: Pegasus e.g. count the number of occurrences of each word 2015/4/20 12

Large Graph Processing System: Pegasus GIM-V (Generalized Iterated Matrix-Vector multiplication) M v = v where v n i = j=1 m i,j v j m 1,1 m 1,n m n,1 m n,n v 1 v n = m 1,1 v 1 + m 1,2 v 2 + + m 1,n v n m n,1 v 1 + m n,2 v 2 + + m n,n v n = v 1 m 1,1 m n,1 + + v n m 1,n m n,n combine2: multiply m i,j and v j combineall: sum n multiplication results for node i assign: overwrite previous value of v i with new result to make v i 2015/4/20 13

Large Graph Processing System: Pegasus Application: PageRank (calculate relative importance of web pages) m 1,1 m 1,n m n,1 m n,n v 1 v n = m 1,1 v 1 + m 1,2 v 2 + + m 1,n v n m n,1 v 1 + m n,2 v 2 + + m n,n v n = v 1 m 1,1 m n,1 + + v n m 1,n m n,n M : a transition matrix, v : rank vector, v : a new rank vector input: an edge file and a vector file Stage 1: performs combine2 operation by combining columns of matrix with rows of vector, outputs key/value pairs Stage 2: combines all partial results from Stage 1 and assigns new vector to the old 2015/4/20 14

Large Graph Processing System: Pregel BSP (Bulk Synchronous Parallel) model 2015/4/20 15

Large Graph Processing System: Pregel Google s implementation of BSP Node -> Vertex Message passing Combiners Aggregators Vertex ID Vertex Value 2015/4/20 16

Large Graph Processing System: Pregel Application: PageRank Initializes the value of each vertex in superstep 0 Vertex sends along each outgoing edges its tentative PageRank divided by edges Each vertex sums up the values arriving on messages into sum and calculate its tentative PageRank in each superstep Terminates when convergence is achieved 2015/4/20 17

Introduction to Social Networks A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest Social network analysis (SNA) is the study of social networks to understand their structure and behavior 2015/4/20 18

Data Mining for Social Network Analysis Community Detection Link Prediction Search in Social Networks Trust in Social Networks Characterization of Social Networks Other Research Topics in Social Networks 2015/4/20 19

Community Detection Discovering communities of users in a social network Community a tightly-knit region of the network Has strong internal node-node connections Weaker external connections Community detection algorithms stress high internal connectivity and low external connectivity with a given community 2015/4/20 20

Girvan-Newman Algorithm Calculate edge-betweenness for all edges Remove the edge with highest betweenness Recalculate betweenness Repeat until all edges are removed, or modularity function is optimized (depending on variation) 2015/4/20 21

Girvan-Newman Algorithm Edge Betweenness Measurement of contributions of an edge to all shortest paths Calculating all-shortest paths between two vertices If there are N paths between any two vertices, each path gets a weight equal to 1/N Edge Betweenness Example EA D-B +0.5 E-B +0.5 E-A +1 Total =2 A E C B D 2015/4/20 22

Girvan-Newman Algorithm: Example 2015/4/20 23

Girvan-Newman Algorithm: Example Betweenness(7-8)= 7x7 = 49 Betweenness(1-3) = 1X12=12 Betweenness(3-7)=betweenness(6-7)=betweenness(8-9) = betweenness(8-12)= 3X11=33 2015/4/20 24

Girvan-Newman Algorithm: Example Betweenness(1-3) = 1X5=5 Betweenness(3-7)=betweenness(6-7)=betweenness(8-9) = betweenness(8-12)= 3X4=12 2015/4/20 25

Girvan-Newman Algorithm: Example Betweenness of every edge = 1 2015/4/20 26

Link Prediction Predict likely interactions, not explicitly observed, based on observed links Primarily used to predict the possibility of new friends, study friend structures and co-authorship networks. Given a snapshot of a social network, it is possible to infer new interactions between members who have never interacted before 2015/4/20 27

Link Prediction Methods Given the input graph G, a connection weight score(x,y) is assigned to a pair of nodes <x,y> A ranked list is produced in decreasing order of score(x,y) It can be viewed as computing a measure of proximity or similarity between nodes x and y 2015/4/20 28

Link Prediction Methods Node Neighborhood Based Methods Common neighbors Jaccard s coefficient Adamic-Adar All Paths Based Methodologies PageRank SimRank Higher Level Approaches Clustering 2015/4/20 29

Node Neighborhood Based Methods Common neighbors socre u, v = N u N v Jaccard s coefficient socre u, v = N u N v / N u N v Adamic-Adar score(u, v) = zεn(u) N(v) 1 log(n(z)) 2015/4/20 30

All Paths Based Method: PageRank PageRank is one of the algorithms that aims to perform object ranking. The assumption PageRank makes is that a user starts a random walk by opening a page and then clicking on a link on that page. 2015/4/20 31

All Paths Based Method: SimRank SimRank is a link analysis algorithm that works on a graph G to measure the similarity between two vertices u and v in the graph. For the nodes u and v, it is denoted by s(u,v) [0,1]. If u=v then, s(u,v)=1 The definition iterates on the similarity index of the neighbors of u and v itself. s u, v = C N u N v a N(u) b N(v) s(a, b) 2015/4/20 32

Conclusion Online query processing Graph database Neo4j Graph Processing Offline graph analysis Large graph mining systems Social Network Analysis Pegasus Pregel Community Detection Link prediction 2015/4/20 33

References Angles R, Gutierrez C. Survey of graph database models[j]. ACM Computing Surveys (CSUR), 2008, 40(1): 1. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[j]. Communications of the ACM, 2008, 51(1): 107-113. Kang U, Tsourakakis C E, Faloutsos C. Pegasus: A peta-scale graph mining system implementation and observations[c]//data Mining, 2009. ICDM'09. Ninth IEEE International Conference on. IEEE, 2009: 229-238. Kang U, Tsourakakis C E, Faloutsos C. Pegasus: mining peta-scale graphs[j]. Knowledge and information systems, 2011, 27(2): 303-325. Malewicz G, Austern M H, Bik A J C, et al. Pregel: a system for large-scale graph processing[c]//proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146. Shao B, Wang H, Xiao Y. Managing and mining large graphs: systems and implementations[c]//proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012: 589-592. 2015/4/20 34

References Newman, Mark EJ. "Modularity and community structure in networks." Proceedings of the National Academy of Sciences 103.23 (2006): 8577-8582. Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010. Girvan, Michelle, and Mark EJ Newman. "Community structure in social and biological networks." Proceedings of the National Academy of Sciences 99.12 (2002): 7821-7826. Liben Nowell, David, and Jon Kleinberg. "The link prediction problem for social networks." Journal of the American society for information science and technology 58.7 (2007): 1019-1031. Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structural-context similarity." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002. 2015/4/20 35

Thank You