Graph Algorithms and Graph Databases. Dr. Daisy Zhe Wang CISE Department University of Florida August 27th 2014

Similar documents

Part 1: Link Analysis & Page Rank

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Search engines: ranking algorithms

Practical Graph Mining with R. 5. Link Analysis

Estimating PageRank Values of Wikipedia Articles using MapReduce

The PageRank Citation Ranking: Bring Order to the Web

1 o Semestre 2007/2008

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Link Analysis. Chapter PageRank

Part 2: Community Detection

G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

Social Network Mining

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

Social Media Mining. Network Measures

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Lecture Data Warehouse Systems

Big Data Analytics. Lucas Rego Drumond

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Graph Processing and Social Networks

SPARQL: Un Lenguaje de Consulta para la Web

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

Network Graph Databases, RDF, SPARQL, and SNA

HIGH PERFORMANCE BIG DATA ANALYTICS

DATA ANALYSIS II. Matrix Algorithms

Social Network Analysis

Social Media Mining. Graph Essentials

Analysis of Web Archives. Vinay Goel Senior Data Engineer

RDF y SPARQL: Dos componentes básicos para la Web de datos

Databases 2 (VU) ( )

Big Systems, Big Data

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Enhancing the Ranking of a Web Page in the Ocean of Data

Systems and Algorithms for Big Data Analytics

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Graph Database Performance: An Oracle Perspective

From relational databases to linked data:r for the semantic web. Jose Quesada, Max Planck Institute, Berlin

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Extending the Linked Data API with RDFa

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Semantic Web Standard in Cloud Computing

Mining Social Network Graphs

A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

How To Improve Performance In A Database

Detection and Elimination of Duplicate Data from Semantic Web Queries

Complex Networks Analysis: Clustering Methods

Graph Database Proof of Concept Report

Scaling Up HBase, Hive, Pegasus

Mining Social-Network Graphs

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

A linear algebraic method for pricing temporary life annuities

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Removing Web Spam Links from Search Engine Results

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Corso di Biblioteche Digitali

Big Data Systems CS 5965/6965 FALL 2015

Perron vector Optimization applied to search engines

HadoopRDF : A Scalable RDF Data Analysis System

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Big Data Analytics. Rasoul Karimi

Efficient Identification of Starters and Followers in Social Media

Machine Learning over Big Data

Semantic Stored Procedures Programming Environment and performance analysis

SGL: Stata graph library for network analysis

Bigdata : Enabling the Semantic Web at Web Scale

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Big Data Analytics. Lucas Rego Drumond

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX

Social Networks and Social Media

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Microblogging Queries on Graph Databases: An Introspection

RDF Support in Oracle Oracle USA Inc.

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Cloud Computing at Google. Architecture

A Comparison of Current Graph Database Models

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Transcription:

Graph Algorithms and Graph Databases Dr. Daisy Zhe Wang CISE Department University of Florida August 27th 2014 1

Google Knowledge Graph -- Entities and Relationships 2

Graph Data! Facebook Social Network graph 3

Graph Data! (cont.) Citation Networks and Maps of Science, 2012 4

Web as a Graph Web as a directed graph: Nodes: Webpages Edges: Hyperlinks 5

The BowTie of the Web 6

LARGE GRAPH DATABASES AND QUERIES

Classes of Large Graphs Random graphs Node degree is constrained Less common Scale-free graphs Distribution of node degree follows power law Most large graphs are scale-free Small world phenomena & hubs Harder to partition 8

Classes of Large Graphs 9

Organic Growth -> Scale Free 10

Let s Go Hyper! Hyper-edge A traditional edge is binary A hyper edge relates n nodes Child-of edge versus father, mother, child hyper-edge Hyper-node A traditional node represents one entity Hyper node represents a set of nodes Person node versus family hyper-node 11

Social Networks Scale LinkedIn 70 million users Facebook 500 million users 65 billion photos Queries Alice s friends Photos with friends Rich graph Types, attributes Photo7 Photo3 Bob Chris Ed David France Alice Photo2 George Hillary Photo1 Photo8 Photo5 Photo6 Photo4 12

Social Networks: Data Model Node Bob Alice ID, type, attributes Photo7 Photo1 Edge Chris David Photo2 Photo8 Connects two nodes Direction, type, attributes Photo3 Ed France George Hillary Photo5 Photo6 Photo4 13

Managing Graph Data Here we focus on online access Rather than offline access Queries Network analytics and graph mining Read Updates Data update: change node payload Structural update: modify nodes and edges 14

Example: Linked open Data[LoD] Semantic Web (Tim Berners-Lee) Scale Hundreds of data sets 30B+ tuples Queries SPARQL Domains http://www4.wiwiss.fu-berlin.de/lodcloud/state/ 15

LoD Application Example ozone level visualization 2 data sets clean air status [data.gov] Castnet site information [epa.gov] 2 SPARQL queries data.gov epa. gov 16

Web of Data: Data Model (1) Resource Description Framework (RDF) Uniform resource identifier (URI) Triples! 1:subject, 2:predicate, 3:object ex.: philippe, made, idmesh_paper: (URIs) 1: http://data.semanticweb.org/person/philippe-cudre-mauroux 2: http://xmlns.com/foaf/0.1/made 3: http://data.semanticweb.org/conference/www/2009/paper/60 17

Web of Data: Data Model (2) Naturally forms (distributed) graphs Nodes URIs [subjects] URIs / literals [objects] Edges URIs [predicates] Directed Philippe made Idmesh paper 18

RDF Schemas (RDFS) Classes, inheritance Class, Property, SubClass, SubProperty Constraints on structure (predicate) Constraints on subjects (Domain) Constraints on objects (Range) Collections List, Bag 19

RDFS Example 20

RDF Storage (1) RDFa Embedding RDF information (e.g., metadata, markup of attributes) in HTML pages Supported by Google, Yahoo, etc Example: <body> <div about="http://dbpedia.org/resource/massachusetts">the Massachusetts governor is <span rel="db:governor"> <span about="http://dbpedia.org/resource/deval_patrick">deval Patrick </span>, </span> the nickname is "<span property="db:nickname">bay State</span>", and the capital <span rel="db:capital"> <span about="http://dbpedia.org/resource/boston"> has the nickname "<span property="db:nickname">beantown</span>". </span> </span> </div> </body> 21

RDF Storage (2) Various internal formats for DBMSs Giant triple table (triple stores) subject predicate object Property tables subject property1 property2 property3 Sub-graphs 22

SPARQL (1/2) Declarative query language over RDF data SPJ combinations of triple patterns E.g., Retrieve all students who live in Seattle and take a graduate course Select?s Where {?s is_a Student?s lives_in Seattle?s takes?c?c is_a GraduateCourse } 23

SPARQL Query Execution Typically start from bound variables and performs self-joins on giant triple table Select?s Where {?s is_a Student?s lives_in Seattle?s takes?c?c is_a GraduateCourse } π s σ p= is_a o= Student π s σ p= lives_in o= Seattle π s (σ p= takes o s σ p= is_a o= GraduateCourse ) 24

SPARQL (2/2) Beyond conjunctions of triple patterns Named graphs Disjunctions UNION OPTIONAL (semi-structured data model) Predicate filters FILTER (?price < 30) Duplicate handling (bag semantics) DISTINCT, REDUCED Wildcards Negation as failure WHERE {?x foaf:givenname?name. OPTIONAL {?x dc:date?date }. FILTER (!bound(?date)) } 25

Inference Queries Additional data can be inferred using various sets of logical rules PoliticianOf(Person,City) LivesIn(Person,City) RetriedFrom(Prof,School) WorkedFor(Prof,School) Specify which ones to use by entailment regimes [Glimm11] RDF Schema has 14 entailment rules E.g., (p,rdfs:domain,x) && (u, p, y) => (u,rdf:type,x) 26

Graph System Interfaces Relational: SQL Triple store: SPARQL Graph processing system/databases: API 27

Graph Database and Distributed Graph Processing Systems RDF-3X, JENA, RDF stores Oracle Triples Store and Semantic Web Technologies Neo4j mostly widely used graph dbms Google Pregel Microsoft Trinity GraphLab, GraphX 28

RDF-3X [Neumann08] Max Planck Institut für Informatik Thomas Neumann & Gerhard Weikum Open-Source Triple-table storage Indexing: Clustered B+-trees on all six SPO permutations Compression: Leaf pages store deltas for each value Query Optimizations: order-preserving mergejoins, join orderings 29

GraphLab Open-source research project from CMU http://graphlab.org/projects/index.html Support parallel graph operations with consistency guarantees via locking Do not support transactions and ad-hoc query processing (different from graph databases such Neo4j) 30

Graph/Network datasets and Example Projects/Applications Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/ Benchmarking Graph Databases: http://istcbigdata.org/index.php/benchmarking-graphdatabases/ PageRank on Semantic Networks with Application to Word Sense Disambiguation: http://logic.cse.unt.edu/tarau/teaching/softe ng/paperstoread/pagerankwsd_coling04.pdf 31

[REVIEW] PAGE RANK: LINK ANALYSIS OVER LARGE GRAPHS Adapted Slides from Jeff Ullman, Anand Rajaraman and Jure Leskovec from Stanford

How to organize the Web? First try: Human curated Web directories (e.g., Yahoo) Second try: Web Search Information Retrieval using inverted index Good for finding relevant docs in a small and trusted set (e.g., Newspaper articles, Patents) But: Web is huge, full of untrusted documents, random things, web spam, etc. E.g., Word Spam 33

2 Challenges of Web Search 1. Web contains many sources of information. Who to trust? Observation: Trustworthy pages point to each other! (in&out-links) 2. What is the best answer to query newspaper? No single right answer Observation: Pages that actually know about newspapers might all be pointing to many newspapers (outlink) Observation: a good newspaper is pointed to from many sources (inlink) 34

Solution: Ranking nodes on the Graph Based on Link Structures! All web pages are not equally importance can be captured by link structures There is large diversity in the web-graph node connectivity. Let s rank the pages by the link structure! Link Spam also possible but harder 35

Link Analysis Link Analysis algorithms: for computing importance of nodes in a graph Page Rank Topic-Specific (personalized) Page Rank Mining for Communities Web Spam Detection Algorithms

How to Rank Web Pages Based on Link Analysis Web pages are not equally important www.joe-schmoe.com vs. www.ufl.edu Page is more important if it has more inlinks Inlinks as votes www.ufl.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink Which inlinks are more important? Recursive question! Links from important pages count more

Example PageRank scores 38

Simple recursive formulation Each link s vote is proportional to the importance of its source page If page j with importance r j has n outlinks, each link gets r j /n votes Page j s own importance is the sum of the votes on its inlinks r j =? = r i /3 + r k /4

Simple flow model y Yahoo y/2 A vote from an important page worth more A page is important if it is pointed to by other important pages a/2 Amazon a y/2 m a/2 M soft m Define a rank r j for page j r j = r /d i j i i d i is out-degree of node i flow equations: y = y /2 + a /2 a = y /2 + m m = a /2

Solving the flow equations 3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor flow equations: y = y /2 + a /2 a = y /2 + m m = a /2 Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large graphs

*Matrix formulation Stochastic adjacency matrix M Matrix M has one row and one column for each web page Let page i has di outlinks If i j, then M ij =1/di, Else M ij =0 M is a column stochastic matrix, where columns sum to 1 Rank vector r : vector with one entry per web page r i is the importance score of page i i r i = 1 The flow equations can be written as r = Mr

Matrix Formulation Example Yahoo y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M soft y = y /2 + a /2 a = y /2 + m m = a /2 r = Mr y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

A More General Example: Flow Equations = Matrix Formulation Remember the flow equation: r j = i j r i /d i Flow equation in the matrix form: r = Mr Suppose page i links to 3 pages, including j

Rank Vector r = Eigenvector of M The flow equations can be written r = Mr The rank vector r is an eigenvector of the stochastic web matrix M x is an eigenvector with the corresponding eigenvalue if: A x = x In fact, its first or principal/dominant eigenvector, with corresponding eigenvalue 1 We can now efficiently solve for r! The method is called Power iteration

Power Iteration method Given a web graph with N nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages Initialize: r 0 = [1/N,.,1/N] T Iterate: r k+1 = Mr k Stop when r k+1 - r k 1 < x 1 = 1 i N x i is the L1 norm Can use any other vector norm e.g., Euclidean norm r (t+1) j = r(t) i /d i j i d i is out-degree of node i

Power Iteration Example Yahoo y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M soft y = y /2 + a /2 a = y /2 + m m = a /2 r = Mr y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

Power Iteration Example (cont.) Amazon Yahoo M soft y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Power Iteration: Set r j =1/N 1: r j = i j r i /d i (i.e., r' = Mr) 2: r=r Goto 1 y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6... 2/5 2/5 1/5

Why Power Iteration works? Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue) r(1)=m r(0) r(2)=m r1=m 2 r0 r(3)=m r2=m 3 r0 Claim: Sequence M r0,m 2 r0, M k r0, approaches the dominant eigenvector of M 49

Random Walk Interpretation Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely Let p(t) be a vector whose i th component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages p(t) is a link-based importance rank t->

The stationary distribution Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t) Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a stationary distribution for the random walk Our rank vector r satisfies r = Mr So r is a stationary distribution p(t) for the random surfer

*Existence and Uniqueness A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions (i.e., strongly connected, and no dead ends), the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0. Is the Web a strongly connected graph?

2 problems: PageRank: Problems (1) Spider traps (all out-links are within the group) Eventually spider traps absorb all importance (2) Some pages are dead ends (have no out-links) Such pages cause importance to leak out 53

Spider traps A group of pages is a spider trap if there are no links from within the group to outside the group Random surfer gets trapped Spider traps violate the conditions needed for the random walk theorem

Random teleports The Google solution for spider traps At each time step, the random surfer has two options: With probability, follow a link at random With probability 1-, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9 Results: Surfer will teleport out of spider trap within a few time steps

Dead ends Pages with no outlinks are dead ends for the random surfer Nowhere to go on next step All importance becomes zero

Teleport Dealing with dead-ends Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly Prune and propagate Preprocess the graph to eliminate dead-ends Might require multiple passes Compute page rank on reduced graph Approximate values for deadends by propagating values from reduced graph

More Graph Analytics and Mining Algorithms Wiki Book on Graph Algorithms: http://en.wikipedia.org/wiki/book:graph_alg orithms Shortest path finding Topological sorting Minimum spanning tree Strongly connected components Cliques 58

Summary Graph Data and Graph Algorithms Page Rank Algorithm Shortest path, Strongly connected components, etc. Graph Database and Queries (e.g., SPARQL) Project Ideas find relevant dataset and papers problem statement and ideas for techniques