maccioni@dia.uniroma3.it y d-b hel topic Antonio Maccioni email Big Data Course locatedin re whe 14 May 2015 is-a when affiliated Big Graph Data Management Rome
this talk is about Graph Databases: models, languages, use cases Graph Database Management Systems Graph Processing Systems (intro) Open Problems in Graph Data Management...and research project around us
scenario is working on PGX (Parallel Graph Analytics), has launched Green-Marl and has implemented an RDF layer on top of Oracle NoSQL developing a layer, VERTEXICA, for graph mining on top of the analytical database is going towards graph search capabilities (Tao, Unicorn, Open Graph protocol, etc.) has been working on the Knowledge Graph for a while, has created Pregel and has launched Cayley launched the language interface SQL-GR to run graph analysis on top of its analytical database has open-sourced its graph database FlockDB has working on Trinity, an in-memory graph database has just aquired Aurelius TitanDB
scenario Over 25 percent of enterprises will use graph databases by 2017 - Enterprise DBMS, Q1 2014. Forrester Research graph databases are catching on commercially - Michael Stonebraker (2014 ACM Turing Award)
nosql scenario Simple data models Graph Databases are an odd fish in the NoSQL pond - P.J. Sadalage, M. Fowler - NoSQL Distilled
nosql scenario Simple data models But if we want to represent connections we may opt for a graph database management systems:
a graph 1 4 2 5 3 6 Adjacency lists Adjacency matrix from\to 1 2 3 4 5 6 1 0 1 0 1 1 0 2 0 0 1 0 0 0 3 0 0 0 0 1 0 4 0 0 0 0 1 1 5 0 0 0 0 0 1 6 0 0 0 0 0 0 [1]->2->4->5 [2]->3 [3]->5 [4]->5->6 [5]->6
graph + data = graph database admin belongs belongs follows likes works married admin belongs likes friends friends worked belongs
why graph databases? More natural modeling Manage connections explicitly Run algorithms of network science (e.g., PageRank)
use case 1: semantic web A Web-scale architecture Web of Data for metadata and data management (together) for interoperarbility of data and services Compatible with other Web technologies Based a set of W3C standards (HTTP, IRI/URI,RDF, SPARQL, OWL) Web 3.0 make information understandable by machines
use case 1: semantic web HTTP request of data by URI You can follow links (the edges of the graph)
use case 1: semantic web Semantic Web + Open Data = Linked Open Data
use case 1: semantic web
sparql SELECT?name {?person1 married?person2.?person2 born?city.?city name Honolulu.?person1 name?name1. }?person1?person2 married name Honolulu name name?city?name1 Which are the names of the people married with people born in Honolulu?
sparql?person1 SELECT?name {?person1 married?person2.?person2 born?city.?city name Honolulu.?person1 name?name1. } Barack Obama name married name Honolulu name Honolulu Pattern matching style of querying name?city?name1 p2 married born name?person2 p1 name Michelle Obama born c1 c2 locatedin name Chicago locatedin n1 name USA
sparql?person1 SELECT?name {?person1 married?person2.?person2 born?city.?city name Honolulu.?person1 name?name1. } Barack Obama name married name Honolulu name Honolulu Pattern matching style of querying name?city?name1 p2 married p1 name Michelle Obama born born name?person2 c1 c2 locatedin name Chicago locatedin n1 name USA
triple stores G: s p o p1 married p2 p1 name p2 born Michelle Obama c2 c2 name Honolulu us name USA......... Indexes on: (s,p,o), (s,o,p), (p,s,o), (p,o,s), (o,s,p), (o,p,s)
query processing?person1 name married Honolulu?person2 name name?city?name?person1?person1 name?person2 married?name name?person2?person1?city?person2?city Honolulu?city?city?person2 G(P=name)?person1 G(P=name) G(P=married) G(P=name and O=Honolulu)
use case 2: social networks admin belongs belongs follows likes works married admin belongs likes friends friends worked belongs
use case 2: social networks Find common friends for every single profile visit FRIEND 1 FROM_DAY FRIEND 2 ~ 1.2 billions tuples * average number of friends per person
use case 2: social networks Find common friends for every single profile visit FRIEND 1 FROM_DAY FRIEND 2 ~ 1.2 billions tuples * average number of friends per person
graph database management systems Three main properties: 1. Property Graph (as data model) 2. Index-free Adjacency (as physical level organization) 3. Path-traversal (as query language)
property graph data model A property graph is a directed multigraph g= (N, E) where every node n N and every edge e E is associated with a set of pairs <key, value>, called properties. It's a schema-less data model
index-free adjacency We say that a (graph) database g satisfies the index-free adjacency if the existence of an edge between two nodes n1 and n2 in g can be tested on those nodes and does not require to access an external, global, index. GOAL: make the cost of a basic traversal independent of the size of the database, in case keeping O(1)
index-free adjacency...trying to keep the cost of a basic traversal O(1)
index-free adjacency...trying to keep the cost of a basic traversal O(1)
neo4j physical layer Store files for different parts of the graph Node store Relationship store Property store Each record contains 4 properties The properties of an element may use more records Property's values can be either stored in the property store or stored in a dynamic string store I. Robinson, J. Webber, E. Eifrem Graph Databases, 2013.
titan, infinitegraph, levelgraph Built above the extensible column store Apache Cassandra Built above the Object Oriented Database Objectivity/DB Built on node.js above the key-value store LevelDB but pluggable to different stores LevelGraph
building a neo4j graph db Server Mode 1. Go to http://www.neo4j.org/, download Neo4J Server and unzip it 2. Run the command./bin/neo4j start (use bin\neo4j.bat on Windows) 3. Find a graphical dashboard at http://localhost:7474/ 4. You can also use it with REST API: http://localhost:7474/db/data/
building a neo4j graph db Embedded mode 1. Import in your java project the library neo4j-kernel-*-*-*.jar and its classes: import org.neo4j.graphdb.*; import org.neo4j.graphdb.factory.graphdatabasefactory; 2. Create the database: GraphDatabaseService gdb = new GraphDatabaseFactory(). newembeddeddatabase("/home/..."); 3. Create nodes and edges: Enum implementing RelationshipType Node n1 = gdb.createnode(); Node n2 = gdb.createnode(); Relationship e12 = n1.createrelationshipto(n2, EdgeType.TYPE); 4. Set the properties: n1.setproperty( name, Rome ); n2.setproperty( name, Italy ); e12.setproperty( type, locatedin );
tinkerpop stack http://www.tinkerpop.com/ BLUEPRINTS Blueprints is a property graph model interface with provided implementations. GREMLIN Gremlin is a domain specific language for traversing property graphs FRAMES Frames exposes the elements of a Blueprints graph as Java objects: software is written in terms of domain objects and their relationships to each other. FURNACE Furnace is a property graph algorithms package REXTER PIPES
building a graph db with blueprints https://github.com/tinkerpop/blueprints/wiki 1. Import in your java project the libraries blueprints-core-*.*.*.jar and blueprints-neo4j-graph-*.*.*.jar with their classes: import com.tinkerpop.blueprints.*; import com.tinkerpop.blueprints.impls.neo4j.neo4jgraph; 2. Create the database: Graph gdb = new Neo4jGraph("/home/..."); 3. Create nodes and edges: Vertex n1 = gdb.addvertex(null); Vertex n2 = gdb.addvertex(null); Edge e12 = gdb.addedge(null, n1, n2, locatedin ); 4. Set the properties: n1.setproperty( name, Rome ); n2.setproperty( name, Italy ); e12.setproperty( type, locatedin );
querying a graph database Gremlin: Imperative query language Descendant of languages such as XPATH Cypher: Declarative query language Descendant of languages such as SQL
gremlin: a path traversal QL gremlin> g = new Neo4jGraph("/home/..."); gremlin> g.v.oute.filter{it.edgeid == 'e1'} ==>e[2][1 EDGE >4] ==>e[1][1 EDGE >3] ==>e[0][1 EDGE >2] ==>e[7][6 EDGE >2]
gremlin: a path traversal QL gremlin> g.v.oute.filter{it.edgeid == 'e1'}.inv.oute. filter{it.edgeid == 'e2'}.inv.nodeid ==>F ==>E ==>E
cypher: a pattern matching QL START: starting point in the graph MATCH: the pattern to match, bound to the starting point WHERE: filtering criteria RETURN: what to return
cypher: a pattern matching QL MATCH: the pattern to match, bound to the starting point node1 edge1 >node2 edge2 >node3 node1 [?] >node2 [?] >node3 node1 [*] >node3 Live hands-on a graph database about beers at http://console.neo4j.org/?id=beer
cypher: a pattern matching QL START n = node(*) MATCH n [r1:edge] >x [r2:edge] >m WHERE (r1.edgeid = 'e1') and (r2.edgeid = 'e2') RETURN m.nodeid ==> F ==> E ==> E
other features Secondary Indexes: defined on properties Transactions: graph databases usually support ACID properties. In Neo4J all operations have to be performed in a transaction: try ( Transaction tx = gdb.begintx() ) { tx.success(); } Other programming language wrappers:
graph processing systems Frameworks to compute (distributed) graph analysis on large graphs: have similar motivations of Hadoop, Spark, etc. help programmers to focus on the algorithm rather than on the implementation support different types of graphs provide a variety of algorithms already implemented Pregel/Giraph GraphLab/Dato Pegasus
google pregel/apache giraph User specifies a vertex program Computation runs a sequence of supersteps In each superstep the program is executed over all the vertexes The program can use messages received in a previous superstep from the neighbors and can send messages to them for the next superstep A vertex can deactivate itself and the computation halts when all the vertexes are deactivated.
research problems about graph DBs How to model a graph database? How to migrate data and queries from existing databases? How to scale queries over large graphs?
modeling graph databases Compact: Sparse: Dense: Reduces the number of data accesses Accesses and updates can be inefficient Reduces number of joins Can violate property graph constraints Needs human intervention for a semantic enrichment
modeling graph databases Orienting the ER: ENTITY 1 ENTITY 1 (0:1) RELATIONSHIP (0:1) ENTITY 2 ENTITY 1 ENTITY 1 (0:N) RELATIONSHIP ENTITY 2 ENTITY 1 ENTITY 1 (0:N) RELATIONSHIP : 2 (0:N) ENTITY 2 RELATIONSHIP : 0 RELATIONSHIP RELATIONSHIP : 1 (0:1) ENTITY 2 ENTITY 2 ENTITY 2 R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
modeling graph databases Oriented-ER: R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
modeling graph databases Partitioning: Rule 1: if a node n is disconnected then it forms a group by itself. Rule 2: if a node n has w (n)>1 and w+(n)>0 then n forms a group by itself. Rule 3: if a node n has w (n)<2 and w+(n)<2 then n is added to the group of a node m such that there exists the edge (m, n) in the O-ER diagram. R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
modeling graph databases Partitioning: R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
modeling graph databases Graph Database Template R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
R2G: from relations to graphs SQL select * from T where T.A1 = v1 R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014
R2G: unifiability of data values Joinable tuples t1 R1 and t2 R2: there is a foreign key constraint between R1.A and R2.B and t1[a] = t2[b]. Unifiability of data values t1[a] and t2[b]: (i) t1=t2 and both A and B do not belong to a multi-attribute key; (ii) t1 and t2 are joinable and A belongs to a multi-attribute key; (iii) t1 and t2 are joinable, A and B do not belong to a multi-attribute key and there is no other tuple t3 that is joinable with t2. R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014
R2G: schema graph Full Schema Paths: FR.fuser US.uid US.uname FR.fuser FR.fblog BG.bid BG.bname FR.fuser FR.fblog BG.bid BG.admin US.uid US.uname... R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014
R2G: data migration R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014
R2G: query migration R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014
scalability over real-world graphs > 500 million users > 1.2 billion active users > 500 million users Graph Databases are hard to scale
real-world graphs 10% of the users follow the same five users Graph Databases are hard to scale power-law graphs preferential attachment scale-free graphs
real-world graphs 10% of the users follow the same five users Graph Databases are very hard to scale Replication Partitioning
real-world graphs A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification
real-world graphs follows fol low s fo l lo ws A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification
real-world graphs A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification
any redundancy? A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification
sparsification SPARSIFICATION high-degree node low-degree node compressor node A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification
compression via sparsification 1 C 2 B 3 A 4 Y D E E GR N O I AT C I SIF R SPA 1 2 4 3 BC C ABC B AB A 5 5 6 6 SPA CEAW ARE SPA RSI FIC ATI ON 1 C B ABC 3 4 6 5 2 A A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification
graph pattern matching with Greedy Compressed Graphs with Space-aware Compressed Graphs A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification
open problems How to shard/partition a graph database? How to visualize large graphs? Specialized startups are addressing the problem Standardization of a query language Graph processing with GPUs Medusa-gpu, MapGraph What is the best way to implements graph layer on top of SQL/NoSQL systems? IBM, Oracle, Teradata, HP,...
conclusion Graphs are used in many fields Graphs are more complicated to manage than other types of data and we need different considerations When we need to store a big graph Social Networks, Bioinformatics, Semantic Web, Geo-informatics,... we have many options, each one with both advantages and disadvantages Scaling queries over graph databases is still an infant area of database industry and research
thanks for the attention