Graph Databases Josep Lluis Larriba Pey David Dominguez Sal



Similar documents
Graph Database Proof of Concept Report

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Microblogging Queries on Graph Databases: An Introspection

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Big Graph Processing: Some Background

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Graph Databases. Prad Nelluru, Bharat Naik, Evan Liu, Bon Koo

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

InfiniteGraph: The Distributed Graph Database

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Review of Graph Databases for Big Data Dynamic Entity Scoring

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Benchmarking Cassandra on Violin

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Bigtable is a proven design Underpins 100+ Google services:

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Bringing Big Data Modelling into the Hands of Domain Experts

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Virtuoso and Database Scalability

Can the Elephants Handle the NoSQL Onslaught?

HIGH PERFORMANCE BIG DATA ANALYTICS

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Benchmarking graph databases on the problem of community detection

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Performance and Scalability Overview

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Systems and Algorithms for Big Data Analytics

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Oracle Big Data SQL Technical Update

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

White Paper. Optimizing the Performance Of MySQL Cluster

Deliverable Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Google Cloud Data Platform & Services. Gregor Hohpe

A Comparison of Current Graph Database Models

Business Application Services Testing

Fast Analytics on Big Data with H20

Real Life Performance of In-Memory Database Systems for BI

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Similarity Search in a Very Large Scale Using Hadoop and HBase

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Graph Database Performance: An Oracle Perspective

Integrating Open Sources and Relational Data with SPARQL

Presto/Blockus: Towards Scalable R Data Analysis

Performance And Scalability In Oracle9i And SQL Server 2000

An empirical comparison of graph databases

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Cloud Computing at Google. Architecture

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Data warehousing with PostgreSQL

Benchmarking Hadoop & HBase on Violin

Performance Analysis of Web based Applications on Single and Multi Core Servers

LDIF - Linked Data Integration Framework

Optimizing the Performance of Your Longview Application

How graph databases started the multi-model revolution

Integrating Apache Spark with an Enterprise Data Warehouse

Social Media Mining. Graph Essentials

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Rackspace Cloud Databases and Container-based Virtualization

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

STINGER: High Performance Data Structure for Streaming Graphs

Trafodion Operational SQL-on-Hadoop

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Information Processing, Big Data, and the Cloud

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Parallel Computing. Benson Muite. benson.

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12


Oracle8i Spatial: Experiences with Extensible Databases

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits

Hypertable Architecture Overview

Bigdata : Enabling the Semantic Web at Web Scale

Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model. An Oracle Technical White Paper May 2005

Comparing SQL and NOSQL databases

Distance Degree Sequences for Network Analysis

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Transcription:

Graph Databases Josep Lluis Larriba Pey David Dominguez Sal Norbert Martinez Bazán www.dama.upc.edu http://sparsity-technologies.com

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Graph technology and management: Sparksee 5. Basic operations: Community search 6. Parallel and distributed graph management 7. Applications: enriching queries 8. Benchmarking: LDBC 9. Conclusions

DAMA-UPC: research universitary group Our goal Research and Technology Transfer on Managing Very Large Data Volumes The beginnings The team Since 1999 Founded inside UPC 14 researchers and developers Our software Awards DEX: massive graph management IBM Faculty Awards (2004, 2009) IBM PhD Award (2004) CINC prize for novel enterpreneurs (2009) Funding, agreements and collaborators Publications > 70 research papers 4 patents

The DAMA-UPC/Sparsity ecosystem Research Graph Mgmt Community detection Parallel graph Mgmt Keyword enrichment Help Apps & Technology Development European projects: LDBC, CoherentPaaS, Tetracom Research Applications Why Sparsity? Social Good vehicle analytics to: Tweeticer Put our technologies in the market Create synergies Social Media with other companies New Allow mobile flexibility Apps with DAMA-UPC: EC projects, hiring of talent Technology Development *Sparsity Technologies Applications Technology Development Graph Mgmt DEX, original GDB Collaboration with Sparsity Support for research Support to Apps Technology Transfer What Technology How does is Sparsity Transfer commercialise? organised? Sparksee Gather Technology 5.1 (linux, the needs Windows, from Companies MacOS) (C++, Java, Python, Research.Net) Collaborations with: IBM, Sparksee Marketing 5.1 mobile Oracle, and (Android, CA Sales Technologies, ios, BB10) Tweeticer Customer Media support Planning, BMAT, etc. Daurum Put our technology in the market

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Graph technology and management: Sparksee 5. Basic operations: Community search 6. Parallel and distributed graph management 7. Applications: enriching queries 8. Benchmarking: LDBC 9. Conclusions

Graphs are everywhere!

Graphs are everywhere! Interest for the structural analysis of entity relationship in many scenarios is growing dramatically The amount of applications calling for efficient large graph management is increasing everyday Handling and querying such large graphs efficiently becomes essential Graph Mining techniques are emerging for analyzing graph-like data and extracting information

Motivating Queries Bibliographic Exploration Authority discovery: Who receives more cites from reputed authors? Who is more central to a topic? Social network of an author Determination of the life of a topic: How a topic evolved backward and forward (citations)? Reviewers recommendation Who is knows about a topic better? Who is authoritative? Who did some authors collaborate with?

Motivating Queries Social Network Analysis Determination of an actor s centrality in a network Degree centrality Closeness centrality ((normalized) inverse of the average distance metric) Betweeness centrality Determination of the role of an actor in a network Actors playing a particular social role have to be equivalent/similar to each other Structural equivalence (based on neighborhood) Differences with Graph Mining: Graphs are usually smaller Power laws are not observed Efficient algorithms are not an issues

Motivating queries (cont d) Root cause analysis Distributed computing environment Determination of the cause of errors Pattern discovery Spread of errors Prediction of malfunctioning Alerts of possible malfunctioning

Small, large and huge graphs Tweets: 177M tweets/day (04/11) x 365 > 64B tweets/year 500M tweets/day (10/12) x 365 > 190B/year 20K average papers/day x 365 > 7M/year, Scopus Internet users: from 1995 to 2010 to 2013: a growth of +100 times the number of internet users, 16M to > 2B to > 2.8B, 29% of the world s population in 2010 and 39% in 2013 3.8B e-mail users send >400B e-mails/day 3B phone calls/day in the USA as 2010 http://www.radicati.com/wp/wp-content/uploads/2011/05/email-statistics-report-2011-2015-executive-summary.pdf http://en.wikipedia.org/wiki/global_internet_usage http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/

Graphs are everywhere! Two models for graph processing Transactional like environments, where queries are oriented to who is.., social network of Restrictions on the data google like queries, person oriented queries. Problems. Privacy, real time aggregation of answers from diff. sources Graph analytical processing environments, where queries are oriented to provide big structural answers Queries on the whole graph like what communities are there in the graph. Problems. How to distribute the graph, parallel implementations

What do we need to represent? Labeled: nodes and edges are typed. Nodes authors, papers and keywords. Edges relations and citations. Directed: edges can have a fixed direction. Relations bidirectional, Citations unidirectional. Attributed: nodes and edges can have multiple singlevalued attributes. Nodes authors have profile. Edges citations have a date. Multigraph: two nodes can be connected by multiple edges An author may be related to a department in different moments

What do we want from graphs? What are the most common queries that we need to perform on graphs? What kind of processes must be carried out to perform these queries? Can we classify these queries? Is there a set of common underlying operations for all these queries?

Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

Example: Usual Operations on Social Networks 2 1 Out-degree 2 5 4 Common Social Network ops 4 2 3 3 4 3 3 Neighbors Shortest Path Degree K-hops Connected Components Closeness 3 2 Betweenness Bridging 1 4 1

Example: Usual Queries on Social Networks 4 3 In-degree 3 4 4 Common Social Network ops 1 2 4 3 4 3 2 Neighbors Shortest Path Degree K-hops Connected Components Closeness 3 3 Betweenness Bridging 1 3 3

Example: Usual Queries on Social Networks 1-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

Example: Usual Queries on Social Networks 2-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

Example: Usual Operations on Social Networks 3-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

Example: Usual Operations on Social Networks 4-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

The graph processing stack I would like to determine who could be a good reviewer for a paper I need to compute, among others the difference between two social graphs I need to iterate over nodes, get neighbours through edge navigation, etc

The stack: tier 1, applications Example: Bibliography setting Who can be a good reviewer for a paper? Analyse paper submitted Find authorities for most important topics in the paper. Find social network of authors and eliminate them from the authorities.

The stack: tier 2, high level ops. High level operations Analyse paper submitted Text analysis. Find authorities for most important topics in the paper Pagerank like algorithm if citations are included in graph Find social network of authors and eliminate them from the authorities Two hop algorithm, 1st hop my papers, 2nd hop my coauthors Graph substraction.

The stack: tier 3, low level ops. Basic analysis: Get node/edge Get attributes from a node or an edge Get neighbors Node degree Basic transformations Add/delete node/edge Add/delete/update attribute

Summary of graph operations D. Dominguez-Sal, N. Martinez-Bazan, V. Muntés-Mulero, P. Baleta, and J-Ll. Larriba-Pey. A discussion on the design of graph database benchmarks. In Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems (TPCTC'10). Springer-Verlag, Berlin, Heidelberg, 25-40.

Initial Conclusions Graphs must be in general Attributed, labeled (types), directed, multigraph Operations on graphs Many operations require to access the whole graph Significant set of cascaded operations Many queries access both the structure and attributes Candidate scenarios to benefit from this technology: any scenario where: Large graph datasets, variety of operations, industrial interest

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations. Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

Sparksee, high performance technology Norbert Martínez-Bazán, Arnau Prat Pérez, David Dominguez Sal, and Josep Lluis Larriba Pey et al. DAMA-UPC/Sparsity Technologies N. Martínez-Bazan et al: Dex: high-performance exploration on large graphs for information retrieval. CIKM 2007: 573-582 N. Martínez-Bazan et al: Efficient graph management based on bitmap indices. IDEAS 2012: 110-119 R. Angles, A. Prat-Pérez, D. Dominguez-Sal, J.-L. Larriba Pey: Benchmarking database systems for social network applications. GRADES 2013: 15

Classical Graph Representation Adjacency matrix Good: Static, good locality for dense graphs (not common), bit based representation Bad: Prohibitively expensive size, neighbors performance, add vertex

Classical Graph Representation Adjacency list Good: size for sparse graphs, neighbors performance Bad: Slower edge exists between two vertices, dynamic memory

Classical Graph Representation Several limitations No labels Node and edge attributes Multigraphs Good for in memory

Relational database Implementation Store each type of node in a table with its associated attributes Store each type edge in a table with its associated attributes Navigation through edges is resolved by join operations Query Language: SQL Not suitable for path traversals and graph exploration Eg: DB2, Oracle, SQLServer, MySQL, Postgres.

Resource Description Framework Resource Description Framework (RDF) Triple format: Subject, Predicate, Object All data is represented as triples, queries are patterns on such triples Edge: a triple where subject and object are nodes Attributes: a triple where subject and object are nodes Query language: SPARQL Not natural representation Eg: Virtuoso, RDF-3X, Sesame and also RDBMS start to support RDF (DB2, Oracle )

New approaches Graph Databases: Graph / edge types Specialized graph API or language (eg. Cypher, Gremlin) Out of core functionalities with buffer pool Eg. Sparksee, Neo4j, OrientDB, Hypergraph, Infinitegraph Distributed graph analysis for large datasets Map-reduce (Pegasus) Vertex-centric computation model (Pregel, Giraph, GraphLab )

Requirements Data and schema represented as a graph Data operations based on graph operations Graph-based integrity restrictions Multigraphs Attributes attached to both vertices and edges Graph queries combining edge traversals with attribute accesses Diversity of workloads Efficient secondary memory management

Sparksee Main features Graph split into small structures Move to main memory just significant parts (caching) Object identifiers (oids) instead of complex objects Reduce memory requirements Specific structures to improve traversals Index the edges and the neighbors of each node Attribute indices Improve queries based on value filters Implemented in C++ Different APIs (Java,.NET, etc.) through wrappers

Sparksee Capabilities Efficiency very compact representation using bitmaps. Highly compressible data structures. Capacity more than 100 billion vertices and edges in a single multicore computer. Performance subsecond response in recommendation queries. Scalability high throughput for concurrent queries. Consistency partial transactional support with recovery. Multiplatform Linux, Windows, MacOSX, Mobile

Sparksee Architecture SparkseePhyton

Graph representation We define a graph G =(V,E,L,T,H,A1,,Ap) as: LABELS L = {(o, l ) o (V E ) l string} TAILS T = {(e, t ) e E t V } HEADS H = {(e, h) e E h V } ATTRIBUTES Ai = {(o, c ) o (V E ) c {int, string,...}} With this representation: the graph is split into multiple lists of pairs the first element of each pair is always a vertex or an edge

Graph representation L (v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS),(e6, CONTAINS), (e7, CONTAINS) T (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) H (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) A_id (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) A_title (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) A_nlc (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en), (e2, en) A_filename (v5, europe.png), (v6, bcn.jpg) A_tag (e4, continent)

Value sets Groups all pairs of the original set with the same value as a pair between the value and the set of objects with such value. L T H Aid Atitle Anlc v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS), (e6, CONTAINS), (e7, CONTAINS) (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en),(e2, en) (ARTICLE, {v1, v2, v3, v4}), (BABEL, {e1, e2}), (CONTAINS, {e5, e6, e7}), (IMAGE, {v5, v6}), (REF, {e3, e4}) (v1, {e1}), (v2, {e2}), (v3, {e5, e6}), (v4, {e3, e4, e7}) (v3, {e1, e2, e3, e4}), (v5, {e5}), (v6, {e6, e7}) (1, {v1, v5}), (2, {v2, v6}), (3, {v3}), (4, {v4}) (Barcelona, {v4}), (Europa, {v1}), (Europe, {v2, v3}) (ca, {v1}), (en, {v3, v4, e1, e2}), (fr, {v2}) Afilename (v5, europe.png), (v6, bcn.jpg) (bcn.jpg, {v6}), (europe.png, {v5}) Atag (e4, continent) (continent, {e4})

Bitmap Index Each vertex and edge is identified by a unique and immutable oid (object identifier) Each vertex or edge set is stored in a bitmap structure: Each position in a bitmap corresponds to the oid of an object Reduced amount of space (compression techniques) Very efficient binary logic operations Eg: ARTICLE, {v1, v2, v3, v4}) -> Article: 111100.

Link: the basic internal structure New way to represent a graph that allows for out-of-core management: Graph is split into smaller structures to favor the caching of significant parts Object identifiers are used to reduce memory requirements (OIDS) Specific structures to help in the navigation and traversal of edges Attributes are fully indexed to allow queries based on filters Bitmaps a 1 oid value 1 1 2 0 oid 3 0 4 0 5 1 b c 2 3 4 1 2 3 0 1 1 1 2 3 4 0 0 0 1 5 Maps Link

Example of a bitmap based representation

Integrity rules

Value set operations Domain: returns the set of distinct values Objects: returns the set of vertices or edges associated to a value Lookup: returns the set of values associated to a set of objects Insert/Remove: Add/Remove a vertex adds a vertex or edge to the collection

Graph query examples Number of articles objects (LABELS, ARTICLE ) Out-degree of English article Europe objects (TAILS, objects( TITLE, Europe ) objects (NLC, en ) objects (LABELS, ARTICLE )) Articles with references to the image with filename bcn.jpg {lookup(tails, x ) x objects (HEAD, objects (FILENAME, bcn.jpg ) objects (LABELS, IMAGE ))} Count the articles of each language {(x, y ) x domain(nlc) y = (objects (NLC, x ) objects (LABELS, ARTICLE )) }

Implementation details Bitmaps are compressed by grouping the bits into clusters of 32 consecutive bits (up to 137 billion objects per graph) Locality is improved by generating consecutive oids for each distinct vertex or edge labels Sorted tree structure of bitmap clusters to speedup the insert, remove, and binary logic operations Maps are implemented using B+ trees The tail, head and attribute value sets have been split into specific value sets for each label

Evaluation (GRADES 13) Objective: understand how small queries stress different DBs: Graph, RDF, Relational Data schema based on social network data ID name location age ID URL creation Person Like WebPage Friend Graph data generator based on R-Mat In collaboration with R. Angles, U. Talca

Query set Selection (Q1) Get all the persons having a name N (Q4) Get the name of the person with a given PID Adjacency (Q2) Get all the persons who like a given webpage W (Q3) Get the webpages that person P likes Pattern matching (Q10) Get the common friends between persons P1 and P2 (Q11) Get the common webpages that persons P1 and P2 like Fixed-length path (Q5) Get the friends of the friends of a given person P (Q6) Get the webpages liked by the friends of a given person P (Q7) Get persons that like a webpage which a person P likes Reachability (Q8) Is there a friend connection between two persons? (Q9) Get the shortest path between two persons Summarization (Q12) Get the number of friends of a person P

Experiments: systems DB System DB Type Implementation Query language Dex (v4.7) Graph API - Neo4j (v1.8.2) Graph API Cypher RDF-3X (v0.3.7) RDF Java driver SPARQL Virtuoso (v7.0) RDF / Column store Java driver Stored procedures SQL + Extension Virtuoso/PL PostgreSQL (v9.1) Row based Java driver Stored procedures SQL PL/PgSQL Test environment Hardware: Intel Xeon E5530 2.4 GHz, 32 GB RAM, 1TB HD Software: Linux Debian 2.6.32-5-amd64 kernel, ext3 file system.

Experiments: query execution test

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

On Demand Memory Specialization for Distributed Graph Databases Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep-Lluis Larriba-Pey Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep- Lluis Larriba-Pey: On Demand Memory Specialization for Distributed Graph Databases. CoRR abs/1310.4802 (2013)

Motivation Large Graphs: Need to be stored and processed in parallel/distributed computers Difficult to partition Partition functions Favour specific operations Have to be executed for different environments Are costly to execute

Contributions System design in two levels Physical storage Memory management Data access pattern monitoring Specific data structure Load and network balancing Increased throughput 66

System Overview Memory management Storage

Partition Manager We propose a new data structure Monitors data access patterns Uses this information in a simple way to decide how to route queries Matrix of data access sequences New compressed data structure 68

Partition Manager We propose a new data structure Monitors data access patterns Uses this information in a simple way to decide how to route queries Matrix of data access sequences New compressed data structure

Experiments Scalability with cluster size Tested up to 32 machines Systems compared Static partitioning Dynamic partitioning (ours) R-MAT graph 37M vertices 1B edges Queries: BFS and k-hops

Experiments: BFS Throughput (more better) Load Imbalance (less better) 71

Experiments

Experiments Average Response Time (less better)

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

Shaping Communities out of Triangles Arnau Prat Pérez, David Dominguez Sal, Josep Maria Brunat and Josep Lluis Larriba Pey DAMA-UPC A. Prat-Pérez, D. Dominguez-Sal, J.L. Larriba-Pey: High quality, scalable and parallel community detection for large real graphs. WWW 2014: 225-236 A. Prat-Pérez, D. Dominguez-Sal, J. M. Brunat, J.L. Larriba-Pey: Shaping communities out of triangles. CIKM 2012: 1677-1681

Community detection Community Informal definition set of nodes well connected among them, but not with other nodes No agreement on a definition

State of the art Community Detection In general, state of the art metrics perform well Modularity Conductance However, under certain circumstances, they fail. Treelike structures Clique chains Algorithms, based on those metrics, fail to provide quality and performance Reason: they focus on edges but ignoring the internal structure

Our goal Find algorithms with a strong focus on: Quality Scalability Parallelism First propose a metric: Weighted Community Clustering. Then, propose algorithm: Scalable Comunity Detection (SCD). First proposal for scalable disjoint community detection algorithm for SMP architectures. Undirected graph without attributes.

Weighted Community Clustering We take triangles as the key factor of a community structure.

Weighted Community Clustering t(x,s) is the number of triangles that vertex x closes with vertices in a set S. vt(x,s) is the number of vertices of a set S that form at least one triangle with x. The WCC of is the average WCC of its vertices.

Properties Maximizing WCC, fulfils minimum properties: Internal Structure, triangles. Linear Community Cohesion, number of connections of a node Bridges, no bridges secured. Cut Vertex Density, minimum density secured.

Algorithm overview

Experimental setup Real graphs with ground truth communities. Average F1Score and NMI to measure quality. Baseline with disjoint and overlapping algorithms: Walktrap, Infomap, Louvain, Bigclam and Oslom. Intel Xeon Quad Core 2.4 Ghz with 32 GB of RAM. Nodes Edges Amazon 0.3 M 0.9 M dblp 0.3 M 1.0 M Youtube 1.1 M 2.9 M Livejournal 3.9 M 34.6 M Orkut 3 M 117.1 M Friendster 65 M 1806 M

Execution Time All executions are single threaded. Executions longer than one week were aborted.

F1Score

Complexity Friendster graph in 4.3 hours! Intel Xeon Quad Core 2.4 Ghz with 32 GB of RAM.

Parallelism Parrallelization with OpenMP.

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval Joan Guisado-Gámez, David Dominguez-Sal, Josep-Lluis Larriba-Pey Joan Guisado-Gámez, David Dominguez-Sal, Josep-Lluis Larriba-Pey: Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval. ICMR 2014: 33

Introduction Quality depends on the user s skills Short keywords based queries Not exact, correct and complete set of keywords.

Query Expansion Process of transforming Q O into Q E Detect expansion features (terms/phrases) What kind of expansion features? How to obtain the expansion features?

Overview

Path detection θ=colored Volkswagen beetles κ=volkswagen beetles in any color for example, red, blue, green or yellow. Volkswagen volkswagen beetle volkswagen fox volkswagen beetle volkswagen passat volkswagen beetle volkswagen type 2 volkswagen beetle volkswagen golf volkswagen beetle volkswagen jetta volkswagen beetle volkswagen touareg volkswagen beetle volkswagen golf mk4 volkswagen beetle volkswagen beetle volkswagen transporter

Build communities Build communities around paths Use paths as seeds Communities retrieve the closest related concepts to the path May not be directly connected to the concept

Example Query Colored volkswagen beetles From Topological Expansion: Volkswagen beetle, Volkswagen fusca, VW type 1, Volkswagen 1200, Volkswagen bug, Volkswagen super bug, VW bug, VW Käfer, Volkswagen new Beetle, Baja bug, Volkswagen group, VW group, H-Shifter, engine, car, automobile,. From Redirect-based Expansion: VW Beetles, VW Beetle, VW Käfer, Volkswagon Beetles, Volkswagon Beetle, Volkswagon Käfer, Volkswagen Beetles, Volkswagen Beetle, Volkswagen Käfer Hundreds of expansion terms

Enrichment No enrichment Results Colored volkswagen beetles

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

LDBC Social Netwok Benchmark

Why benchmarking Two main objectives: Allow final users to assess the performance of the software they want to buy Push the technology to the limit to allow for progress Main effort in DB benchmarking up to now TPC: Transaction Processing Performance Council Relational DBs: Transactional and DSS

What is LDBC? Non-lucrative organization formed in 2012 from an European Union Project. Defines and develops benchmarks for graphlike data management technologies: Graph Data Base Management Systems (GDBMS) RDF Systemes Graph Processing Frameworks LDBC provides: All the documentation (workload definitions, running instructions, disclosure guidelines) of the benchmark it develops All the necessary data and software to run the benchmarks

What is LDBC? Participated by principal actors in graph data management and RDF:

Objectives: Benchmarks for the emerging field of RDF and Graph database management systems (GDBs) Spur industry cooperation around benchmarks LDBC foundation created and operative on 2Q 2014. Benchmarks created: Semantic Publishing Benchmark (SPB) Social Network Benchmark (SNB) Web site: www.ldbcouncil.org Software repositories in Github

What is LDBC? Currently developing two benchmarks: Social Network Benchmark (LDBC-SNB): for testing the performance of graph databases and graph processing frameworks, inspired by the management of a social network Social Publishing Benchmark (LDBC-SPB): for testing the performance of RDF engines inspired by the Media/Publishing industry For more information, visit: www.ldbcouncil.org

The Social Network Benchmark LDBC-SNB[1] Philosophy: Rich coverage Modularity Small implementation cost Relevance Reproducibility Open source, you may do whatever you want with it!!! Mimics the operation of a real social network: Simple to understand Allows the testing of a complete range of interesting challenges Can be easily scaled [1] https://github.com/ldbc/ldbc_snb_docs

The SNB workloads Interactive: Interactive queries representing the interaction of the users with the social network Low latency and multiple concurrent users Small data accessed 14 read and 8 update Business Intelligence: Complex structured queries for analyzing online behavior of users for marketing purposes Non interactive queries Moderate data accessed 22 queries Graph Analytics: Expensive graph analytical queries (PageRank, Centrality, Clustering ) Large data accessed Not defined yet Being designed

LDBC-SNB

DATA SCHEMA

INTERACTIVE Consists of 14 read queries and 8 update queries Read Query Read Query 1 Read Query 2 Read Query 3 Read Query 4 Read Query 5 Read Query 6 Read Query 7 High level description Friends with a certain name Recent posts and comments by your Friends Friends and Friends of Friends that have been in countries X and Y New topics New groups Tag co-occurrence Recent likes

INTERACTIVE Read Query Read Query 8 Read Query 9 Read Query 10 Read Query 11 Read Query 12 Read Query 13 Read Query 14 High level description Recent replies Recent posts and comments by Friends and Friends of your Friends Friend reccomendation Job referral Expert search Single shortest path Weighted paths

INTERACTIVE Update Query Update Query 1 Update Query 2 Update Query 3 Update Query 4 Update Query 5 Update Query 6 Update Query 7 Update Query 8 High level description Add Person Add Friendship Add Forum Add Forum Membership Add Post Add Like to Post Add Comment Add Like to Comment

BUSSINESS INTELLIGENCE Consists of 22 read queries Query Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 High level description Post stats Volume in forums on a subject in a topic The locally going thing Thread length distribution Best publicist Branding hour Market share Cross-border conversation

BUSSINESS INTELLIGENCE Query Query 9 Query 10 Query 11 Query 12 Query 13 Query 14 Query 15 Query 16 High level description Person with most posts in foreign language People who studied in the same university Teenagers talking to strangers Liker clique Data quality Find unusual values Development of mindshare Biggest posters on a tag

BUSSINESS INTELLIGENCE Query Query 17 Query 18 Query 19 Query 20 Query 21 Query 22 High level description Country bias Posting activity Hub Countries Moving predicates Mole hunt Concept promoter

DATAGEN Datasets are synthetically generated using DATAGEN[1] Realistic Scalable Deterministic Usable Datasets simulate a social network s activity during a period of time. Uses Hadoop (MapReduce) to scale to milions of entities. [1] https://github.com/ldbc/ldbc_snb_datagen

DATAGEN 16 dictionaries extracted from Dbpedia are used to produce correlated attributes. i.e. Names by Country, Tags by Country, Companies by Country, etc Reproduces realistic distributions found in real social networks. i.e. Friendship degree distribution mimics that found in Facebook. Reproduces the homophily principle a.k.a similar people tend to be connected. Generates substitution parameters for the queries of the workloads. Implemented with Hadoop to generate huge datasets

SCALE FACTORS We define a set of scale factors: SF1, SF3, SF10, SF1000 Different scale factors target systems of different sizes and characteristics From single node machines to large clusters. Each scale factor represents a data set of a different size, in gigabytes: SF Persons Activity Size SF1 11000 3 years 1GB SF10 73000 3 years 10GB SF100 499000 3 years 100GB

DATAGEN degree dist. Facebook [1] [1] https://www.facebook.com/notes/facebook -data-team/anatomy-of- Datagen SF1

LDBC Execution Driver We provide the SNB execution driver[1]: Easy to extend Small implementation cost Multiple concurrent threads Each query type is assigned an interleave interval. Is the time between two query instances of the same type are issued Queries vary in complexity. We must ensure no one dominates execution time Interleave intervals are extracted experimentally Tracks dependencies between update queries automatically: i.e. We cannot insert a post before its autor is inserted Automatically gathers and reports performance metrics [1] https://github.com/ldbc/ldbc_driver

Conclusions LDBC defines and develops benchmarks for graph-like data management Relevant Easy to adopt Two benchmarks: SNB with three different workloads for different types graph management systems SPB for RDF engines LDBC-SNB provides: A realistic social network data generator An execution driver All software is open source

Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

Conclusions GRAPH management is becoming important Different applications where traversals, pattern matching and complex graph operations are needed Performance is an issue Building high performance systems: Sparksee, vertical store based on bitmap processing Distributed management of graphs Community search: important to take structure into account LDBC-SNB provides: Benchmarks relevant for industry Provides two tastes of benchmarks, SPB and SNB Open to everybody to use, explore, and propose new ideas in benchmarking

Any questions? Contact e-mail: larri@ac.upc.edu DAMA Group Web Site: http://www.dama.upc.edu

Sparksee evaluation (IDEAS) Additional material

Commercial Graph DBMSs

DEX: Some internal details All the structures presented have been implemented in the current version of the DEX core engine. Support for 37-bit unsigned integer nids and eids, more than 137 billion objects per graph. Identifiers clustered in groups for each node or edge type. Bitmaps are compressed by grouping the bits in clusters of 32 consecutive bits only if at least one bit set are stored. Maps are implemented using B+ trees N. Martínez-Bazán, V. Muntés-Mulero, S. Gómez-Villamor, J. Nin, M. Sánchez-Martínez, and J.Ll. Larriba-Pey. DEX: High Performance Exploration on Large Graphs for Information Retrieval. In Proceedings of the 16th ACM Conference on Information and Knowledge Management Conference (CIKM), Lisbon, pages 573-582, 2007. N. Martínez-Bazan, S. Gómez-Villamor, V. Muntés-Mulero and J. Ll. Larriba-Pey. Procedure to represent and manipulate multigraphs based on the use of bitmaps. Spanish Patent and Trademark Office, Patent Pending #40444, 22/July/2008. N. Martínez-Bazán. DEX: High-Performance Graph Databases. Master Thesis in Computer Architecture and Network Systems, Universitat Politècnica de Catalunya, Barcelona, September/2008.

Performance and Memory Usage Benchmark: Wikipedia from 254 different languages with 57 million articles, 2.1 million images, more than 321 million links, and 483 million attribute values. (17 GB) Query Time (s) Traversals Bitmaps (MB) Mem. usage (MB) find the article with the largest outdegree Q1 large SPT (BFS) 118.93 and find 236387207 a shortest-path tree (SPT) 357.63 only 832.19 recommend considering related the edges articles of type to the REF most Q2 top-k 2-hop 205.97 261735954 popular one 1445.31 2974.50 find new images for articles in other Q3 pattern matching 10.68 1536698 154.00 320.81 find, for each different languages language, the number of Q4 group by & count 146.77 articles and the 4987879 number of images referenced 190.38 245.13 for Q5 update 2.1M 141.06 by each those article, 5934724 articles materialize without an repetition attribute indicating the number of images contained, 257.25 319.00 Q6 delete ~90% 7518.06 remove only if it 281433106 all contains the articles more without than one 11583.88 any image. image 17093.63 Experiments performed using a computer with two quadcore Intel(R) Xeon(R) E5440 at 2.83 GHz. The memory hierarchy is 6144 KB second level cache, 64 GB main memory, and a disk with 1.7 TB. The operating system is Linux Debian etch 4.0

Analysis of the Distribution of Bitmaps 99.97% of the bitmaps are smaller than 250 bytes and occupy only 32% of data

Analysis of Bitmap Usage

Comparison with Other Approaches In-memory MonetDb Neo4J DEX Data (GB) 15.09 12.00 82.00 16.98 Load time (hours) 0.32 0.74 8.22 2.25 Q1 large SPT (BFS) 12.97 106.14 32230.00 118.93 Q2 top-k 2-hop 51.86 120.50 24832.00 205.97 Q3 pattern matching 6.28 7.56 2045.00 10.68 Q4 group by & count 31.65 84.97 34882.00 146.77 Q5 update 2.1M 176.15 48.34 32539.00 141.06 Q6 delete ~90% 965.56 760.85 > 1 week 7518.06

Scalability

Scalability

SPB Benchmark - Sparksee on RDF Additional material

Schema

Choke points 1. CP1 Join ordering 2. CP2 Aggregation 3. CP3 Optional and nested optional clauses 4. CP4 Reasoning 5. CP5 Parallel execution of unions 6. CP6 Optional with filters 7. CP7 Ordering 8. CP8 Geo-spatial predictes 9. CP9 Full text search 10. CP10 Duplicate elimination 11. CP11 Complex filter conditions

Queries (I)

Queries (II)

Characteristics of SPB queries

Results