Big Graph Data Management

Similar documents

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

Graph Databases: Neo4j

How graph databases started the multi-model revolution

Cloud Scale Distributed Data Storage. Jürmo Mehine

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Converting Relational to Graph Databases

A Comparison of Current Graph Database Models

NoSQL and Graph Database

Client Overview. Engagement Situation. Key Requirements

Introduction to NOSQL

TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12

Integrating Big Data into the Computing Curricula

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Objectivity positions graph database as relational complement to InfiniteGraph 3.0

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Analysis of Web Archives. Vinay Goel Senior Data Engineer

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Big Data Analytics. Rasoul Karimi

NoSQL and Hadoop Technologies On Oracle Cloud

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Databases 2 (VU) ( )

InfiniteGraph: The Distributed Graph Database

Challenges for Data Driven Systems

Data Modeling for Big Data

1-Oct 2015, Bilbao, Spain. Towards Semantic Network Models via Graph Databases for SDN Applications

HIGH PERFORMANCE BIG DATA ANALYTICS

How To Improve Performance In A Database

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Taming Big Data Variety with Semantic Graph Databases. Evren Sirin CTO Complexible

HadoopRDF : A Scalable RDF Data Analysis System

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

How To Scale Out Of A Nosql Database

NoSQL Databases. Nikos Parlavantzas

Cloud Computing at Google. Architecture

In Memory Accelerator for MongoDB

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015

Review of Graph Databases for Big Data Dynamic Entity Scoring

Domain driven design, NoSQL and multi-model databases

The Synergy Between the Object Database, Graph Database, Cloud Computing and NoSQL Paradigms

Graph Processing and Social Networks

Lecture Data Warehouse Systems

Big Graph Processing: Some Background

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Graph Database Performance: An Oracle Perspective

Introduction to Big Data Training

Architectures for massive data management

NoSQL Evaluation. A Use Case Oriented Survey

NOSQL DATABASES IN EEG/ERP

Oracle Big Data SQL Technical Update

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Evaluating partitioning of big graphs

Big Data and Scripting Systems beyond Hadoop

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Semantic Web Success Story

Teradata s Big Data Technology Strategy & Roadmap

Architectures for Big Data Analytics A database perspective

A Brief Study of Open Source Graph Databases

Graph Database Applications and Concepts with Neo4j

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

INTRODUCTION TO CASSANDRA

Big Data and Analytics: Challenges and Opportunities

Graph Processing with Apache TinkerPop

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Cloud Computing and Advanced Relationship Analytics

Big Data looks Tiny from the Stratosphere

Software Life-Cycle Management

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Infrastructures for big data

Relational Database Basics Review

Rule-Based Engineering Using Declarative Graph Database Queries

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Scalable Architecture on Amazon AWS Cloud

Big Data and Scripting Systems build on top of Hadoop

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

An Approach to Implement Map Reduce with NoSQL Databases

Enterprise Operational SQL on Hadoop Trafodion Overview

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

Big Data and Data Science: Behind the Buzz Words

Transcription:

maccioni@dia.uniroma3.it y d-b hel topic Antonio Maccioni email Big Data Course locatedin re whe 14 May 2015 is-a when affiliated Big Graph Data Management Rome

this talk is about Graph Databases: models, languages, use cases Graph Database Management Systems Graph Processing Systems (intro) Open Problems in Graph Data Management...and research project around us

scenario is working on PGX (Parallel Graph Analytics), has launched Green-Marl and has implemented an RDF layer on top of Oracle NoSQL developing a layer, VERTEXICA, for graph mining on top of the analytical database is going towards graph search capabilities (Tao, Unicorn, Open Graph protocol, etc.) has been working on the Knowledge Graph for a while, has created Pregel and has launched Cayley launched the language interface SQL-GR to run graph analysis on top of its analytical database has open-sourced its graph database FlockDB has working on Trinity, an in-memory graph database has just aquired Aurelius TitanDB

scenario Over 25 percent of enterprises will use graph databases by 2017 - Enterprise DBMS, Q1 2014. Forrester Research graph databases are catching on commercially - Michael Stonebraker (2014 ACM Turing Award)

nosql scenario Simple data models Graph Databases are an odd fish in the NoSQL pond - P.J. Sadalage, M. Fowler - NoSQL Distilled

nosql scenario Simple data models But if we want to represent connections we may opt for a graph database management systems:

a graph 1 4 2 5 3 6 Adjacency lists Adjacency matrix from\to 1 2 3 4 5 6 1 0 1 0 1 1 0 2 0 0 1 0 0 0 3 0 0 0 0 1 0 4 0 0 0 0 1 1 5 0 0 0 0 0 1 6 0 0 0 0 0 0 [1]->2->4->5 [2]->3 [3]->5 [4]->5->6 [5]->6

graph + data = graph database admin belongs belongs follows likes works married admin belongs likes friends friends worked belongs

why graph databases? More natural modeling Manage connections explicitly Run algorithms of network science (e.g., PageRank)

use case 1: semantic web A Web-scale architecture Web of Data for metadata and data management (together) for interoperarbility of data and services Compatible with other Web technologies Based a set of W3C standards (HTTP, IRI/URI,RDF, SPARQL, OWL) Web 3.0 make information understandable by machines

use case 1: semantic web HTTP request of data by URI You can follow links (the edges of the graph)

use case 1: semantic web Semantic Web + Open Data = Linked Open Data

use case 1: semantic web

sparql SELECT?name {?person1 married?person2.?person2 born?city.?city name Honolulu.?person1 name?name1. }?person1?person2 married name Honolulu name name?city?name1 Which are the names of the people married with people born in Honolulu?

sparql?person1 SELECT?name {?person1 married?person2.?person2 born?city.?city name Honolulu.?person1 name?name1. } Barack Obama name married name Honolulu name Honolulu Pattern matching style of querying name?city?name1 p2 married born name?person2 p1 name Michelle Obama born c1 c2 locatedin name Chicago locatedin n1 name USA

sparql?person1 SELECT?name {?person1 married?person2.?person2 born?city.?city name Honolulu.?person1 name?name1. } Barack Obama name married name Honolulu name Honolulu Pattern matching style of querying name?city?name1 p2 married p1 name Michelle Obama born born name?person2 c1 c2 locatedin name Chicago locatedin n1 name USA

triple stores G: s p o p1 married p2 p1 name p2 born Michelle Obama c2 c2 name Honolulu us name USA......... Indexes on: (s,p,o), (s,o,p), (p,s,o), (p,o,s), (o,s,p), (o,p,s)

query processing?person1 name married Honolulu?person2 name name?city?name?person1?person1 name?person2 married?name name?person2?person1?city?person2?city Honolulu?city?city?person2 G(P=name)?person1 G(P=name) G(P=married) G(P=name and O=Honolulu)

use case 2: social networks admin belongs belongs follows likes works married admin belongs likes friends friends worked belongs

use case 2: social networks Find common friends for every single profile visit FRIEND 1 FROM_DAY FRIEND 2 ~ 1.2 billions tuples * average number of friends per person

use case 2: social networks Find common friends for every single profile visit FRIEND 1 FROM_DAY FRIEND 2 ~ 1.2 billions tuples * average number of friends per person

graph database management systems Three main properties: 1. Property Graph (as data model) 2. Index-free Adjacency (as physical level organization) 3. Path-traversal (as query language)

property graph data model A property graph is a directed multigraph g= (N, E) where every node n N and every edge e E is associated with a set of pairs <key, value>, called properties. It's a schema-less data model

index-free adjacency We say that a (graph) database g satisfies the index-free adjacency if the existence of an edge between two nodes n1 and n2 in g can be tested on those nodes and does not require to access an external, global, index. GOAL: make the cost of a basic traversal independent of the size of the database, in case keeping O(1)

index-free adjacency...trying to keep the cost of a basic traversal O(1)

index-free adjacency...trying to keep the cost of a basic traversal O(1)

neo4j physical layer Store files for different parts of the graph Node store Relationship store Property store Each record contains 4 properties The properties of an element may use more records Property's values can be either stored in the property store or stored in a dynamic string store I. Robinson, J. Webber, E. Eifrem Graph Databases, 2013.

titan, infinitegraph, levelgraph Built above the extensible column store Apache Cassandra Built above the Object Oriented Database Objectivity/DB Built on node.js above the key-value store LevelDB but pluggable to different stores LevelGraph

building a neo4j graph db Server Mode 1. Go to http://www.neo4j.org/, download Neo4J Server and unzip it 2. Run the command./bin/neo4j start (use bin\neo4j.bat on Windows) 3. Find a graphical dashboard at http://localhost:7474/ 4. You can also use it with REST API: http://localhost:7474/db/data/

building a neo4j graph db Embedded mode 1. Import in your java project the library neo4j-kernel-*-*-*.jar and its classes: import org.neo4j.graphdb.*; import org.neo4j.graphdb.factory.graphdatabasefactory; 2. Create the database: GraphDatabaseService gdb = new GraphDatabaseFactory(). newembeddeddatabase("/home/..."); 3. Create nodes and edges: Enum implementing RelationshipType Node n1 = gdb.createnode(); Node n2 = gdb.createnode(); Relationship e12 = n1.createrelationshipto(n2, EdgeType.TYPE); 4. Set the properties: n1.setproperty( name, Rome ); n2.setproperty( name, Italy ); e12.setproperty( type, locatedin );

tinkerpop stack http://www.tinkerpop.com/ BLUEPRINTS Blueprints is a property graph model interface with provided implementations. GREMLIN Gremlin is a domain specific language for traversing property graphs FRAMES Frames exposes the elements of a Blueprints graph as Java objects: software is written in terms of domain objects and their relationships to each other. FURNACE Furnace is a property graph algorithms package REXTER PIPES

building a graph db with blueprints https://github.com/tinkerpop/blueprints/wiki 1. Import in your java project the libraries blueprints-core-*.*.*.jar and blueprints-neo4j-graph-*.*.*.jar with their classes: import com.tinkerpop.blueprints.*; import com.tinkerpop.blueprints.impls.neo4j.neo4jgraph; 2. Create the database: Graph gdb = new Neo4jGraph("/home/..."); 3. Create nodes and edges: Vertex n1 = gdb.addvertex(null); Vertex n2 = gdb.addvertex(null); Edge e12 = gdb.addedge(null, n1, n2, locatedin ); 4. Set the properties: n1.setproperty( name, Rome ); n2.setproperty( name, Italy ); e12.setproperty( type, locatedin );

querying a graph database Gremlin: Imperative query language Descendant of languages such as XPATH Cypher: Declarative query language Descendant of languages such as SQL

gremlin: a path traversal QL gremlin> g = new Neo4jGraph("/home/..."); gremlin> g.v.oute.filter{it.edgeid == 'e1'} ==>e[2][1 EDGE >4] ==>e[1][1 EDGE >3] ==>e[0][1 EDGE >2] ==>e[7][6 EDGE >2]

gremlin: a path traversal QL gremlin> g.v.oute.filter{it.edgeid == 'e1'}.inv.oute. filter{it.edgeid == 'e2'}.inv.nodeid ==>F ==>E ==>E

cypher: a pattern matching QL START: starting point in the graph MATCH: the pattern to match, bound to the starting point WHERE: filtering criteria RETURN: what to return

cypher: a pattern matching QL MATCH: the pattern to match, bound to the starting point node1 edge1 >node2 edge2 >node3 node1 [?] >node2 [?] >node3 node1 [*] >node3 Live hands-on a graph database about beers at http://console.neo4j.org/?id=beer

cypher: a pattern matching QL START n = node(*) MATCH n [r1:edge] >x [r2:edge] >m WHERE (r1.edgeid = 'e1') and (r2.edgeid = 'e2') RETURN m.nodeid ==> F ==> E ==> E

other features Secondary Indexes: defined on properties Transactions: graph databases usually support ACID properties. In Neo4J all operations have to be performed in a transaction: try ( Transaction tx = gdb.begintx() ) { tx.success(); } Other programming language wrappers:

graph processing systems Frameworks to compute (distributed) graph analysis on large graphs: have similar motivations of Hadoop, Spark, etc. help programmers to focus on the algorithm rather than on the implementation support different types of graphs provide a variety of algorithms already implemented Pregel/Giraph GraphLab/Dato Pegasus

google pregel/apache giraph User specifies a vertex program Computation runs a sequence of supersteps In each superstep the program is executed over all the vertexes The program can use messages received in a previous superstep from the neighbors and can send messages to them for the next superstep A vertex can deactivate itself and the computation halts when all the vertexes are deactivated.

research problems about graph DBs How to model a graph database? How to migrate data and queries from existing databases? How to scale queries over large graphs?

modeling graph databases Compact: Sparse: Dense: Reduces the number of data accesses Accesses and updates can be inefficient Reduces number of joins Can violate property graph constraints Needs human intervention for a semantic enrichment

modeling graph databases Orienting the ER: ENTITY 1 ENTITY 1 (0:1) RELATIONSHIP (0:1) ENTITY 2 ENTITY 1 ENTITY 1 (0:N) RELATIONSHIP ENTITY 2 ENTITY 1 ENTITY 1 (0:N) RELATIONSHIP : 2 (0:N) ENTITY 2 RELATIONSHIP : 0 RELATIONSHIP RELATIONSHIP : 1 (0:1) ENTITY 2 ENTITY 2 ENTITY 2 R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.

modeling graph databases Oriented-ER: R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.

modeling graph databases Partitioning: Rule 1: if a node n is disconnected then it forms a group by itself. Rule 2: if a node n has w (n)>1 and w+(n)>0 then n forms a group by itself. Rule 3: if a node n has w (n)<2 and w+(n)<2 then n is added to the group of a node m such that there exists the edge (m, n) in the O-ER diagram. R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.

modeling graph databases Partitioning: R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.

modeling graph databases Graph Database Template R. De Virgilio, A. Maccioni, R. Torlone Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.

R2G: from relations to graphs SQL select * from T where T.A1 = v1 R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014

R2G: unifiability of data values Joinable tuples t1 R1 and t2 R2: there is a foreign key constraint between R1.A and R2.B and t1[a] = t2[b]. Unifiability of data values t1[a] and t2[b]: (i) t1=t2 and both A and B do not belong to a multi-attribute key; (ii) t1 and t2 are joinable and A belongs to a multi-attribute key; (iii) t1 and t2 are joinable, A and B do not belong to a multi-attribute key and there is no other tuple t3 that is joinable with t2. R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014

R2G: schema graph Full Schema Paths: FR.fuser US.uid US.uname FR.fuser FR.fblog BG.bid BG.bname FR.fuser FR.fblog BG.bid BG.admin US.uid US.uname... R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014

R2G: data migration R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014

R2G: query migration R. De Virgilio, A. Maccioni, R. Torlone R2G: a Tool for Migrating Relations to Graphs EDBT International Conference on Extending Database Technology, 2014

scalability over real-world graphs > 500 million users > 1.2 billion active users > 500 million users Graph Databases are hard to scale

real-world graphs 10% of the users follow the same five users Graph Databases are hard to scale power-law graphs preferential attachment scale-free graphs

real-world graphs 10% of the users follow the same five users Graph Databases are very hard to scale Replication Partitioning

real-world graphs A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification

real-world graphs follows fol low s fo l lo ws A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification

real-world graphs A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification

any redundancy? A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification

sparsification SPARSIFICATION high-degree node low-degree node compressor node A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification

compression via sparsification 1 C 2 B 3 A 4 Y D E E GR N O I AT C I SIF R SPA 1 2 4 3 BC C ABC B AB A 5 5 6 6 SPA CEAW ARE SPA RSI FIC ATI ON 1 C B ABC 3 4 6 5 2 A A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification

graph pattern matching with Greedy Compressed Graphs with Space-aware Compressed Graphs A. Maccioni, D. J. Abadi Scalable Pattern Matching over Compressed Graphs via Sparsification

open problems How to shard/partition a graph database? How to visualize large graphs? Specialized startups are addressing the problem Standardization of a query language Graph processing with GPUs Medusa-gpu, MapGraph What is the best way to implements graph layer on top of SQL/NoSQL systems? IBM, Oracle, Teradata, HP,...

conclusion Graphs are used in many fields Graphs are more complicated to manage than other types of data and we need different considerations When we need to store a big graph Social Networks, Bioinformatics, Semantic Web, Geo-informatics,... we have many options, each one with both advantages and disadvantages Scaling queries over graph databases is still an infant area of database industry and research

thanks for the attention