Graph Databases Josep Lluis Larriba Pey David Dominguez Sal

Transcription

1 Graph Databases Josep Lluis Larriba Pey David Dominguez Sal Norbert Martinez Bazán

2 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Graph technology and management: Sparksee 5. Basic operations: Community search 6. Parallel and distributed graph management 7. Applications: enriching queries 8. Benchmarking: LDBC 9. Conclusions

3 DAMA-UPC: research universitary group Our goal Research and Technology Transfer on Managing Very Large Data Volumes The beginnings The team Since 1999 Founded inside UPC 14 researchers and developers Our software Awards DEX: massive graph management IBM Faculty Awards (2004, 2009) IBM PhD Award (2004) CINC prize for novel enterpreneurs (2009) Funding, agreements and collaborators Publications > 70 research papers 4 patents

4 The DAMA-UPC/Sparsity ecosystem Research Graph Mgmt Community detection Parallel graph Mgmt Keyword enrichment Help Apps & Technology Development European projects: LDBC, CoherentPaaS, Tetracom Research Applications Why Sparsity? Social Good vehicle analytics to: Tweeticer Put our technologies in the market Create synergies Social Media with other companies New Allow mobile flexibility Apps with DAMA-UPC: EC projects, hiring of talent Technology Development *Sparsity Technologies Applications Technology Development Graph Mgmt DEX, original GDB Collaboration with Sparsity Support for research Support to Apps Technology Transfer What Technology How does is Sparsity Transfer commercialise? organised? Sparksee Gather Technology 5.1 (linux, the needs Windows, from Companies MacOS) (C++, Java, Python, Research.Net) Collaborations with: IBM, Sparksee Marketing 5.1 mobile Oracle, and (Android, CA Sales Technologies, ios, BB10) Tweeticer Customer Media support Planning, BMAT, etc. Daurum Put our technology in the market

5 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Graph technology and management: Sparksee 5. Basic operations: Community search 6. Parallel and distributed graph management 7. Applications: enriching queries 8. Benchmarking: LDBC 9. Conclusions

6 Graphs are everywhere!

7 Graphs are everywhere! Interest for the structural analysis of entity relationship in many scenarios is growing dramatically The amount of applications calling for efficient large graph management is increasing everyday Handling and querying such large graphs efficiently becomes essential Graph Mining techniques are emerging for analyzing graph-like data and extracting information

8 Motivating Queries Bibliographic Exploration Authority discovery: Who receives more cites from reputed authors? Who is more central to a topic? Social network of an author Determination of the life of a topic: How a topic evolved backward and forward (citations)? Reviewers recommendation Who is knows about a topic better? Who is authoritative? Who did some authors collaborate with?

9 Motivating Queries Social Network Analysis Determination of an actor s centrality in a network Degree centrality Closeness centrality ((normalized) inverse of the average distance metric) Betweeness centrality Determination of the role of an actor in a network Actors playing a particular social role have to be equivalent/similar to each other Structural equivalence (based on neighborhood) Differences with Graph Mining: Graphs are usually smaller Power laws are not observed Efficient algorithms are not an issues

10 Motivating queries (cont d) Root cause analysis Distributed computing environment Determination of the cause of errors Pattern discovery Spread of errors Prediction of malfunctioning Alerts of possible malfunctioning

11 Small, large and huge graphs Tweets: 177M tweets/day (04/11) x 365 > 64B tweets/year 500M tweets/day (10/12) x 365 > 190B/year 20K average papers/day x 365 > 7M/year, Scopus Internet users: from 1995 to 2010 to 2013: a growth of +100 times the number of internet users, 16M to > 2B to > 2.8B, 29% of the world s population in 2010 and 39% in B users send >400B s/day 3B phone calls/day in the USA as

12 Graphs are everywhere! Two models for graph processing Transactional like environments, where queries are oriented to who is.., social network of Restrictions on the data google like queries, person oriented queries. Problems. Privacy, real time aggregation of answers from diff. sources Graph analytical processing environments, where queries are oriented to provide big structural answers Queries on the whole graph like what communities are there in the graph. Problems. How to distribute the graph, parallel implementations

13 What do we need to represent? Labeled: nodes and edges are typed. Nodes authors, papers and keywords. Edges relations and citations. Directed: edges can have a fixed direction. Relations bidirectional, Citations unidirectional. Attributed: nodes and edges can have multiple singlevalued attributes. Nodes authors have profile. Edges citations have a date. Multigraph: two nodes can be connected by multiple edges An author may be related to a department in different moments

14 What do we want from graphs? What are the most common queries that we need to perform on graphs? What kind of processes must be carried out to perform these queries? Can we classify these queries? Is there a set of common underlying operations for all these queries?

15 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

18 Example: Usual Operations on Social Networks 2 1 Out-degree Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness 3 2 Betweenness Bridging 1 4 1

19 Example: Usual Queries on Social Networks 4 3 In-degree Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness 3 3 Betweenness Bridging 1 3 3

20 Example: Usual Queries on Social Networks 1-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

21 Example: Usual Queries on Social Networks 2-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

22 Example: Usual Operations on Social Networks 3-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

23 Example: Usual Operations on Social Networks 4-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

25 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

26 The graph processing stack I would like to determine who could be a good reviewer for a paper I need to compute, among others the difference between two social graphs I need to iterate over nodes, get neighbours through edge navigation, etc

27 The stack: tier 1, applications Example: Bibliography setting Who can be a good reviewer for a paper? Analyse paper submitted Find authorities for most important topics in the paper. Find social network of authors and eliminate them from the authorities.

28 The stack: tier 2, high level ops. High level operations Analyse paper submitted Text analysis. Find authorities for most important topics in the paper Pagerank like algorithm if citations are included in graph Find social network of authors and eliminate them from the authorities Two hop algorithm, 1st hop my papers, 2nd hop my coauthors Graph substraction.

29 The stack: tier 3, low level ops. Basic analysis: Get node/edge Get attributes from a node or an edge Get neighbors Node degree Basic transformations Add/delete node/edge Add/delete/update attribute

30 Summary of graph operations D. Dominguez-Sal, N. Martinez-Bazan, V. Muntés-Mulero, P. Baleta, and J-Ll. Larriba-Pey. A discussion on the design of graph database benchmarks. In Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems (TPCTC'10). Springer-Verlag, Berlin, Heidelberg,

31 Initial Conclusions Graphs must be in general Attributed, labeled (types), directed, multigraph Operations on graphs Many operations require to access the whole graph Significant set of cascaded operations Many queries access both the structure and attributes Candidate scenarios to benefit from this technology: any scenario where: Large graph datasets, variety of operations, industrial interest

32 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations. Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

33 Sparksee, high performance technology Norbert Martínez-Bazán, Arnau Prat Pérez, David Dominguez Sal, and Josep Lluis Larriba Pey et al. DAMA-UPC/Sparsity Technologies N. Martínez-Bazan et al: Dex: high-performance exploration on large graphs for information retrieval. CIKM 2007: N. Martínez-Bazan et al: Efficient graph management based on bitmap indices. IDEAS 2012: R. Angles, A. Prat-Pérez, D. Dominguez-Sal, J.-L. Larriba Pey: Benchmarking database systems for social network applications. GRADES 2013: 15

34 Classical Graph Representation Adjacency matrix Good: Static, good locality for dense graphs (not common), bit based representation Bad: Prohibitively expensive size, neighbors performance, add vertex

35 Classical Graph Representation Adjacency list Good: size for sparse graphs, neighbors performance Bad: Slower edge exists between two vertices, dynamic memory

36 Classical Graph Representation Several limitations No labels Node and edge attributes Multigraphs Good for in memory

37 Relational database Implementation Store each type of node in a table with its associated attributes Store each type edge in a table with its associated attributes Navigation through edges is resolved by join operations Query Language: SQL Not suitable for path traversals and graph exploration Eg: DB2, Oracle, SQLServer, MySQL, Postgres.

38 Resource Description Framework Resource Description Framework (RDF) Triple format: Subject, Predicate, Object All data is represented as triples, queries are patterns on such triples Edge: a triple where subject and object are nodes Attributes: a triple where subject and object are nodes Query language: SPARQL Not natural representation Eg: Virtuoso, RDF-3X, Sesame and also RDBMS start to support RDF (DB2, Oracle )

39 New approaches Graph Databases: Graph / edge types Specialized graph API or language (eg. Cypher, Gremlin) Out of core functionalities with buffer pool Eg. Sparksee, Neo4j, OrientDB, Hypergraph, Infinitegraph Distributed graph analysis for large datasets Map-reduce (Pegasus) Vertex-centric computation model (Pregel, Giraph, GraphLab )

40 Requirements Data and schema represented as a graph Data operations based on graph operations Graph-based integrity restrictions Multigraphs Attributes attached to both vertices and edges Graph queries combining edge traversals with attribute accesses Diversity of workloads Efficient secondary memory management

41 Sparksee Main features Graph split into small structures Move to main memory just significant parts (caching) Object identifiers (oids) instead of complex objects Reduce memory requirements Specific structures to improve traversals Index the edges and the neighbors of each node Attribute indices Improve queries based on value filters Implemented in C++ Different APIs (Java,.NET, etc.) through wrappers

42 Sparksee Capabilities Efficiency very compact representation using bitmaps. Highly compressible data structures. Capacity more than 100 billion vertices and edges in a single multicore computer. Performance subsecond response in recommendation queries. Scalability high throughput for concurrent queries. Consistency partial transactional support with recovery. Multiplatform Linux, Windows, MacOSX, Mobile

43 Sparksee Architecture SparkseePhyton

44 Graph representation We define a graph G =(V,E,L,T,H,A1,,Ap) as: LABELS L = {(o, l ) o (V E ) l string} TAILS T = {(e, t ) e E t V } HEADS H = {(e, h) e E h V } ATTRIBUTES Ai = {(o, c ) o (V E ) c {int, string,...}} With this representation: the graph is split into multiple lists of pairs the first element of each pair is always a vertex or an edge

45 Graph representation L (v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS),(e6, CONTAINS), (e7, CONTAINS) T (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) H (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) A_id (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) A_title (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) A_nlc (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en), (e2, en) A_filename (v5, europe.png), (v6, bcn.jpg) A_tag (e4, continent)

46 Value sets Groups all pairs of the original set with the same value as a pair between the value and the set of objects with such value. L T H Aid Atitle Anlc v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS), (e6, CONTAINS), (e7, CONTAINS) (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en),(e2, en) (ARTICLE, {v1, v2, v3, v4}), (BABEL, {e1, e2}), (CONTAINS, {e5, e6, e7}), (IMAGE, {v5, v6}), (REF, {e3, e4}) (v1, {e1}), (v2, {e2}), (v3, {e5, e6}), (v4, {e3, e4, e7}) (v3, {e1, e2, e3, e4}), (v5, {e5}), (v6, {e6, e7}) (1, {v1, v5}), (2, {v2, v6}), (3, {v3}), (4, {v4}) (Barcelona, {v4}), (Europa, {v1}), (Europe, {v2, v3}) (ca, {v1}), (en, {v3, v4, e1, e2}), (fr, {v2}) Afilename (v5, europe.png), (v6, bcn.jpg) (bcn.jpg, {v6}), (europe.png, {v5}) Atag (e4, continent) (continent, {e4})

47 Bitmap Index Each vertex and edge is identified by a unique and immutable oid (object identifier) Each vertex or edge set is stored in a bitmap structure: Each position in a bitmap corresponds to the oid of an object Reduced amount of space (compression techniques) Very efficient binary logic operations Eg: ARTICLE, {v1, v2, v3, v4}) -> Article:

48 Link: the basic internal structure New way to represent a graph that allows for out-of-core management: Graph is split into smaller structures to favor the caching of significant parts Object identifiers are used to reduce memory requirements (OIDS) Specific structures to help in the navigation and traversal of edges Attributes are fully indexed to allow queries based on filters Bitmaps a 1 oid value oid b c Maps Link

49 Example of a bitmap based representation

50 Integrity rules

51 Value set operations Domain: returns the set of distinct values Objects: returns the set of vertices or edges associated to a value Lookup: returns the set of values associated to a set of objects Insert/Remove: Add/Remove a vertex adds a vertex or edge to the collection

52 Graph query examples Number of articles objects (LABELS, ARTICLE ) Out-degree of English article Europe objects (TAILS, objects( TITLE, Europe ) objects (NLC, en ) objects (LABELS, ARTICLE )) Articles with references to the image with filename bcn.jpg {lookup(tails, x ) x objects (HEAD, objects (FILENAME, bcn.jpg ) objects (LABELS, IMAGE ))} Count the articles of each language {(x, y ) x domain(nlc) y = (objects (NLC, x ) objects (LABELS, ARTICLE )) }

53 Implementation details Bitmaps are compressed by grouping the bits into clusters of 32 consecutive bits (up to 137 billion objects per graph) Locality is improved by generating consecutive oids for each distinct vertex or edge labels Sorted tree structure of bitmap clusters to speedup the insert, remove, and binary logic operations Maps are implemented using B+ trees The tail, head and attribute value sets have been split into specific value sets for each label

54 Evaluation (GRADES 13) Objective: understand how small queries stress different DBs: Graph, RDF, Relational Data schema based on social network data ID name location age ID URL creation Person Like WebPage Friend Graph data generator based on R-Mat In collaboration with R. Angles, U. Talca

55 Query set Selection (Q1) Get all the persons having a name N (Q4) Get the name of the person with a given PID Adjacency (Q2) Get all the persons who like a given webpage W (Q3) Get the webpages that person P likes Pattern matching (Q10) Get the common friends between persons P1 and P2 (Q11) Get the common webpages that persons P1 and P2 like Fixed-length path (Q5) Get the friends of the friends of a given person P (Q6) Get the webpages liked by the friends of a given person P (Q7) Get persons that like a webpage which a person P likes Reachability (Q8) Is there a friend connection between two persons? (Q9) Get the shortest path between two persons Summarization (Q12) Get the number of friends of a person P

56 Experiments: systems DB System DB Type Implementation Query language Dex (v4.7) Graph API - Neo4j (v1.8.2) Graph API Cypher RDF-3X (v0.3.7) RDF Java driver SPARQL Virtuoso (v7.0) RDF / Column store Java driver Stored procedures SQL + Extension Virtuoso/PL PostgreSQL (v9.1) Row based Java driver Stored procedures SQL PL/PgSQL Test environment Hardware: Intel Xeon E GHz, 32 GB RAM, 1TB HD Software: Linux Debian amd64 kernel, ext3 file system.

57 Experiments: query execution test

59 On Demand Memory Specialization for Distributed Graph Databases Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep-Lluis Larriba-Pey Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep- Lluis Larriba-Pey: On Demand Memory Specialization for Distributed Graph Databases. CoRR abs/ (2013)

60 Motivation Large Graphs: Need to be stored and processed in parallel/distributed computers Difficult to partition Partition functions Favour specific operations Have to be executed for different environments Are costly to execute

61 Contributions System design in two levels Physical storage Memory management Data access pattern monitoring Specific data structure Load and network balancing Increased throughput 66

62 System Overview Memory management Storage

63 Partition Manager We propose a new data structure Monitors data access patterns Uses this information in a simple way to decide how to route queries Matrix of data access sequences New compressed data structure 68

64 Partition Manager We propose a new data structure Monitors data access patterns Uses this information in a simple way to decide how to route queries Matrix of data access sequences New compressed data structure

65 Experiments Scalability with cluster size Tested up to 32 machines Systems compared Static partitioning Dynamic partitioning (ours) R-MAT graph 37M vertices 1B edges Queries: BFS and k-hops

66 Experiments: BFS Throughput (more better) Load Imbalance (less better) 71

67 Experiments

68 Experiments Average Response Time (less better)

70 Shaping Communities out of Triangles Arnau Prat Pérez, David Dominguez Sal, Josep Maria Brunat and Josep Lluis Larriba Pey DAMA-UPC A. Prat-Pérez, D. Dominguez-Sal, J.L. Larriba-Pey: High quality, scalable and parallel community detection for large real graphs. WWW 2014: A. Prat-Pérez, D. Dominguez-Sal, J. M. Brunat, J.L. Larriba-Pey: Shaping communities out of triangles. CIKM 2012:

71 Community detection Community Informal definition set of nodes well connected among them, but not with other nodes No agreement on a definition

72 State of the art Community Detection In general, state of the art metrics perform well Modularity Conductance However, under certain circumstances, they fail. Treelike structures Clique chains Algorithms, based on those metrics, fail to provide quality and performance Reason: they focus on edges but ignoring the internal structure

73 Our goal Find algorithms with a strong focus on: Quality Scalability Parallelism First propose a metric: Weighted Community Clustering. Then, propose algorithm: Scalable Comunity Detection (SCD). First proposal for scalable disjoint community detection algorithm for SMP architectures. Undirected graph without attributes.

74 Weighted Community Clustering We take triangles as the key factor of a community structure.

75 Weighted Community Clustering t(x,s) is the number of triangles that vertex x closes with vertices in a set S. vt(x,s) is the number of vertices of a set S that form at least one triangle with x. The WCC of is the average WCC of its vertices.

76 Properties Maximizing WCC, fulfils minimum properties: Internal Structure, triangles. Linear Community Cohesion, number of connections of a node Bridges, no bridges secured. Cut Vertex Density, minimum density secured.

77 Algorithm overview

78 Experimental setup Real graphs with ground truth communities. Average F1Score and NMI to measure quality. Baseline with disjoint and overlapping algorithms: Walktrap, Infomap, Louvain, Bigclam and Oslom. Intel Xeon Quad Core 2.4 Ghz with 32 GB of RAM. Nodes Edges Amazon 0.3 M 0.9 M dblp 0.3 M 1.0 M Youtube 1.1 M 2.9 M Livejournal 3.9 M 34.6 M Orkut 3 M M Friendster 65 M 1806 M

79 Execution Time All executions are single threaded. Executions longer than one week were aborted.

80 F1Score

81 Complexity Friendster graph in 4.3 hours! Intel Xeon Quad Core 2.4 Ghz with 32 GB of RAM.

82 Parallelism Parrallelization with OpenMP.

84 Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval Joan Guisado-Gámez, David Dominguez-Sal, Josep-Lluis Larriba-Pey Joan Guisado-Gámez, David Dominguez-Sal, Josep-Lluis Larriba-Pey: Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval. ICMR 2014: 33

85 Introduction Quality depends on the user s skills Short keywords based queries Not exact, correct and complete set of keywords.

86 Query Expansion Process of transforming Q O into Q E Detect expansion features (terms/phrases) What kind of expansion features? How to obtain the expansion features?

87 Overview

88 Path detection θ=colored Volkswagen beetles κ=volkswagen beetles in any color for example, red, blue, green or yellow. Volkswagen volkswagen beetle volkswagen fox volkswagen beetle volkswagen passat volkswagen beetle volkswagen type 2 volkswagen beetle volkswagen golf volkswagen beetle volkswagen jetta volkswagen beetle volkswagen touareg volkswagen beetle volkswagen golf mk4 volkswagen beetle volkswagen beetle volkswagen transporter

89 Build communities Build communities around paths Use paths as seeds Communities retrieve the closest related concepts to the path May not be directly connected to the concept

90 Example Query Colored volkswagen beetles From Topological Expansion: Volkswagen beetle, Volkswagen fusca, VW type 1, Volkswagen 1200, Volkswagen bug, Volkswagen super bug, VW bug, VW Käfer, Volkswagen new Beetle, Baja bug, Volkswagen group, VW group, H-Shifter, engine, car, automobile,. From Redirect-based Expansion: VW Beetles, VW Beetle, VW Käfer, Volkswagon Beetles, Volkswagon Beetle, Volkswagon Käfer, Volkswagen Beetles, Volkswagen Beetle, Volkswagen Käfer Hundreds of expansion terms

91 Enrichment No enrichment Results Colored volkswagen beetles

93 LDBC Social Netwok Benchmark

94 Why benchmarking Two main objectives: Allow final users to assess the performance of the software they want to buy Push the technology to the limit to allow for progress Main effort in DB benchmarking up to now TPC: Transaction Processing Performance Council Relational DBs: Transactional and DSS

95 What is LDBC? Non-lucrative organization formed in 2012 from an European Union Project. Defines and develops benchmarks for graphlike data management technologies: Graph Data Base Management Systems (GDBMS) RDF Systemes Graph Processing Frameworks LDBC provides: All the documentation (workload definitions, running instructions, disclosure guidelines) of the benchmark it develops All the necessary data and software to run the benchmarks

96 What is LDBC? Participated by principal actors in graph data management and RDF:

97 Objectives: Benchmarks for the emerging field of RDF and Graph database management systems (GDBs) Spur industry cooperation around benchmarks LDBC foundation created and operative on 2Q Benchmarks created: Semantic Publishing Benchmark (SPB) Social Network Benchmark (SNB) Web site: Software repositories in Github

98 What is LDBC? Currently developing two benchmarks: Social Network Benchmark (LDBC-SNB): for testing the performance of graph databases and graph processing frameworks, inspired by the management of a social network Social Publishing Benchmark (LDBC-SPB): for testing the performance of RDF engines inspired by the Media/Publishing industry For more information, visit:

99 The Social Network Benchmark LDBC-SNB[1] Philosophy: Rich coverage Modularity Small implementation cost Relevance Reproducibility Open source, you may do whatever you want with it!!! Mimics the operation of a real social network: Simple to understand Allows the testing of a complete range of interesting challenges Can be easily scaled [1]

100 The SNB workloads Interactive: Interactive queries representing the interaction of the users with the social network Low latency and multiple concurrent users Small data accessed 14 read and 8 update Business Intelligence: Complex structured queries for analyzing online behavior of users for marketing purposes Non interactive queries Moderate data accessed 22 queries Graph Analytics: Expensive graph analytical queries (PageRank, Centrality, Clustering ) Large data accessed Not defined yet Being designed

101 LDBC-SNB

102 DATA SCHEMA

103 INTERACTIVE Consists of 14 read queries and 8 update queries Read Query Read Query 1 Read Query 2 Read Query 3 Read Query 4 Read Query 5 Read Query 6 Read Query 7 High level description Friends with a certain name Recent posts and comments by your Friends Friends and Friends of Friends that have been in countries X and Y New topics New groups Tag co-occurrence Recent likes

104 INTERACTIVE Read Query Read Query 8 Read Query 9 Read Query 10 Read Query 11 Read Query 12 Read Query 13 Read Query 14 High level description Recent replies Recent posts and comments by Friends and Friends of your Friends Friend reccomendation Job referral Expert search Single shortest path Weighted paths

105 INTERACTIVE Update Query Update Query 1 Update Query 2 Update Query 3 Update Query 4 Update Query 5 Update Query 6 Update Query 7 Update Query 8 High level description Add Person Add Friendship Add Forum Add Forum Membership Add Post Add Like to Post Add Comment Add Like to Comment

106 BUSSINESS INTELLIGENCE Consists of 22 read queries Query Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 High level description Post stats Volume in forums on a subject in a topic The locally going thing Thread length distribution Best publicist Branding hour Market share Cross-border conversation

107 BUSSINESS INTELLIGENCE Query Query 9 Query 10 Query 11 Query 12 Query 13 Query 14 Query 15 Query 16 High level description Person with most posts in foreign language People who studied in the same university Teenagers talking to strangers Liker clique Data quality Find unusual values Development of mindshare Biggest posters on a tag

108 BUSSINESS INTELLIGENCE Query Query 17 Query 18 Query 19 Query 20 Query 21 Query 22 High level description Country bias Posting activity Hub Countries Moving predicates Mole hunt Concept promoter

109 DATAGEN Datasets are synthetically generated using DATAGEN[1] Realistic Scalable Deterministic Usable Datasets simulate a social network s activity during a period of time. Uses Hadoop (MapReduce) to scale to milions of entities. [1]

110 DATAGEN 16 dictionaries extracted from Dbpedia are used to produce correlated attributes. i.e. Names by Country, Tags by Country, Companies by Country, etc Reproduces realistic distributions found in real social networks. i.e. Friendship degree distribution mimics that found in Facebook. Reproduces the homophily principle a.k.a similar people tend to be connected. Generates substitution parameters for the queries of the workloads. Implemented with Hadoop to generate huge datasets

111 SCALE FACTORS We define a set of scale factors: SF1, SF3, SF10, SF1000 Different scale factors target systems of different sizes and characteristics From single node machines to large clusters. Each scale factor represents a data set of a different size, in gigabytes: SF Persons Activity Size SF years 1GB SF years 10GB SF years 100GB

112 DATAGEN degree dist. Facebook [1] [1] -data-team/anatomy-of- Datagen SF1

113 LDBC Execution Driver We provide the SNB execution driver[1]: Easy to extend Small implementation cost Multiple concurrent threads Each query type is assigned an interleave interval. Is the time between two query instances of the same type are issued Queries vary in complexity. We must ensure no one dominates execution time Interleave intervals are extracted experimentally Tracks dependencies between update queries automatically: i.e. We cannot insert a post before its autor is inserted Automatically gathers and reports performance metrics [1]

114 Conclusions LDBC defines and develops benchmarks for graph-like data management Relevant Easy to adopt Two benchmarks: SNB with three different workloads for different types graph management systems SPB for RDF engines LDBC-SNB provides: A realistic social network data generator An execution driver All software is open source

116 Conclusions GRAPH management is becoming important Different applications where traversals, pattern matching and complex graph operations are needed Performance is an issue Building high performance systems: Sparksee, vertical store based on bitmap processing Distributed management of graphs Community search: important to take structure into account LDBC-SNB provides: Benchmarks relevant for industry Provides two tastes of benchmarks, SPB and SNB Open to everybody to use, explore, and propose new ideas in benchmarking

117 Any questions? Contact DAMA Group Web Site:

118 Sparksee evaluation (IDEAS) Additional material

119 Commercial Graph DBMSs

120 DEX: Some internal details All the structures presented have been implemented in the current version of the DEX core engine. Support for 37-bit unsigned integer nids and eids, more than 137 billion objects per graph. Identifiers clustered in groups for each node or edge type. Bitmaps are compressed by grouping the bits in clusters of 32 consecutive bits only if at least one bit set are stored. Maps are implemented using B+ trees N. Martínez-Bazán, V. Muntés-Mulero, S. Gómez-Villamor, J. Nin, M. Sánchez-Martínez, and J.Ll. Larriba-Pey. DEX: High Performance Exploration on Large Graphs for Information Retrieval. In Proceedings of the 16th ACM Conference on Information and Knowledge Management Conference (CIKM), Lisbon, pages , N. Martínez-Bazan, S. Gómez-Villamor, V. Muntés-Mulero and J. Ll. Larriba-Pey. Procedure to represent and manipulate multigraphs based on the use of bitmaps. Spanish Patent and Trademark Office, Patent Pending #40444, 22/July/2008. N. Martínez-Bazán. DEX: High-Performance Graph Databases. Master Thesis in Computer Architecture and Network Systems, Universitat Politècnica de Catalunya, Barcelona, September/2008.

121 Performance and Memory Usage Benchmark: Wikipedia from 254 different languages with 57 million articles, 2.1 million images, more than 321 million links, and 483 million attribute values. (17 GB) Query Time (s) Traversals Bitmaps (MB) Mem. usage (MB) find the article with the largest outdegree Q1 large SPT (BFS) and find a shortest-path tree (SPT) only recommend considering related the edges articles of type to the REF most Q2 top-k 2-hop popular one find new images for articles in other Q3 pattern matching find, for each different languages language, the number of Q4 group by & count articles and the number of images referenced for Q5 update 2.1M by each those article, articles materialize without an repetition attribute indicating the number of images contained, Q6 delete ~90% remove only if it all contains the articles more without than one any image. image Experiments performed using a computer with two quadcore Intel(R) Xeon(R) E5440 at 2.83 GHz. The memory hierarchy is 6144 KB second level cache, 64 GB main memory, and a disk with 1.7 TB. The operating system is Linux Debian etch 4.0

122 Analysis of the Distribution of Bitmaps 99.97% of the bitmaps are smaller than 250 bytes and occupy only 32% of data

123 Analysis of Bitmap Usage

124 Comparison with Other Approaches In-memory MonetDb Neo4J DEX Data (GB) Load time (hours) Q1 large SPT (BFS) Q2 top-k 2-hop Q3 pattern matching Q4 group by & count Q5 update 2.1M Q6 delete ~90% > 1 week

125 Scalability

126 Scalability

127 SPB Benchmark - Sparksee on RDF Additional material

128 Schema

129 Choke points 1. CP1 Join ordering 2. CP2 Aggregation 3. CP3 Optional and nested optional clauses 4. CP4 Reasoning 5. CP5 Parallel execution of unions 6. CP6 Optional with filters 7. CP7 Ordering 8. CP8 Geo-spatial predictes 9. CP9 Full text search 10. CP10 Duplicate elimination 11. CP11 Complex filter conditions

130 Queries (I)

131 Queries (II)

132 Characteristics of SPB queries

133 Results