Graph Databases Josep Lluis Larriba Pey David Dominguez Sal

Size: px
Start display at page:

Download "Graph Databases Josep Lluis Larriba Pey David Dominguez Sal"

Transcription

1 Graph Databases Josep Lluis Larriba Pey David Dominguez Sal Norbert Martinez Bazán

2 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Graph technology and management: Sparksee 5. Basic operations: Community search 6. Parallel and distributed graph management 7. Applications: enriching queries 8. Benchmarking: LDBC 9. Conclusions

3 DAMA-UPC: research universitary group Our goal Research and Technology Transfer on Managing Very Large Data Volumes The beginnings The team Since 1999 Founded inside UPC 14 researchers and developers Our software Awards DEX: massive graph management IBM Faculty Awards (2004, 2009) IBM PhD Award (2004) CINC prize for novel enterpreneurs (2009) Funding, agreements and collaborators Publications > 70 research papers 4 patents

4 The DAMA-UPC/Sparsity ecosystem Research Graph Mgmt Community detection Parallel graph Mgmt Keyword enrichment Help Apps & Technology Development European projects: LDBC, CoherentPaaS, Tetracom Research Applications Why Sparsity? Social Good vehicle analytics to: Tweeticer Put our technologies in the market Create synergies Social Media with other companies New Allow mobile flexibility Apps with DAMA-UPC: EC projects, hiring of talent Technology Development *Sparsity Technologies Applications Technology Development Graph Mgmt DEX, original GDB Collaboration with Sparsity Support for research Support to Apps Technology Transfer What Technology How does is Sparsity Transfer commercialise? organised? Sparksee Gather Technology 5.1 (linux, the needs Windows, from Companies MacOS) (C++, Java, Python, Research.Net) Collaborations with: IBM, Sparksee Marketing 5.1 mobile Oracle, and (Android, CA Sales Technologies, ios, BB10) Tweeticer Customer Media support Planning, BMAT, etc. Daurum Put our technology in the market

5 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Graph technology and management: Sparksee 5. Basic operations: Community search 6. Parallel and distributed graph management 7. Applications: enriching queries 8. Benchmarking: LDBC 9. Conclusions

6 Graphs are everywhere!

7 Graphs are everywhere! Interest for the structural analysis of entity relationship in many scenarios is growing dramatically The amount of applications calling for efficient large graph management is increasing everyday Handling and querying such large graphs efficiently becomes essential Graph Mining techniques are emerging for analyzing graph-like data and extracting information

8 Motivating Queries Bibliographic Exploration Authority discovery: Who receives more cites from reputed authors? Who is more central to a topic? Social network of an author Determination of the life of a topic: How a topic evolved backward and forward (citations)? Reviewers recommendation Who is knows about a topic better? Who is authoritative? Who did some authors collaborate with?

9 Motivating Queries Social Network Analysis Determination of an actor s centrality in a network Degree centrality Closeness centrality ((normalized) inverse of the average distance metric) Betweeness centrality Determination of the role of an actor in a network Actors playing a particular social role have to be equivalent/similar to each other Structural equivalence (based on neighborhood) Differences with Graph Mining: Graphs are usually smaller Power laws are not observed Efficient algorithms are not an issues

10 Motivating queries (cont d) Root cause analysis Distributed computing environment Determination of the cause of errors Pattern discovery Spread of errors Prediction of malfunctioning Alerts of possible malfunctioning

11 Small, large and huge graphs Tweets: 177M tweets/day (04/11) x 365 > 64B tweets/year 500M tweets/day (10/12) x 365 > 190B/year 20K average papers/day x 365 > 7M/year, Scopus Internet users: from 1995 to 2010 to 2013: a growth of +100 times the number of internet users, 16M to > 2B to > 2.8B, 29% of the world s population in 2010 and 39% in B users send >400B s/day 3B phone calls/day in the USA as

12 Graphs are everywhere! Two models for graph processing Transactional like environments, where queries are oriented to who is.., social network of Restrictions on the data google like queries, person oriented queries. Problems. Privacy, real time aggregation of answers from diff. sources Graph analytical processing environments, where queries are oriented to provide big structural answers Queries on the whole graph like what communities are there in the graph. Problems. How to distribute the graph, parallel implementations

13 What do we need to represent? Labeled: nodes and edges are typed. Nodes authors, papers and keywords. Edges relations and citations. Directed: edges can have a fixed direction. Relations bidirectional, Citations unidirectional. Attributed: nodes and edges can have multiple singlevalued attributes. Nodes authors have profile. Edges citations have a date. Multigraph: two nodes can be connected by multiple edges An author may be related to a department in different moments

14 What do we want from graphs? What are the most common queries that we need to perform on graphs? What kind of processes must be carried out to perform these queries? Can we classify these queries? Is there a set of common underlying operations for all these queries?

15 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

16 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

17 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

18 Example: Usual Operations on Social Networks 2 1 Out-degree Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness 3 2 Betweenness Bridging 1 4 1

19 Example: Usual Queries on Social Networks 4 3 In-degree Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness 3 3 Betweenness Bridging 1 3 3

20 Example: Usual Queries on Social Networks 1-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

21 Example: Usual Queries on Social Networks 2-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

22 Example: Usual Operations on Social Networks 3-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

23 Example: Usual Operations on Social Networks 4-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

24 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging

25 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

26 The graph processing stack I would like to determine who could be a good reviewer for a paper I need to compute, among others the difference between two social graphs I need to iterate over nodes, get neighbours through edge navigation, etc

27 The stack: tier 1, applications Example: Bibliography setting Who can be a good reviewer for a paper? Analyse paper submitted Find authorities for most important topics in the paper. Find social network of authors and eliminate them from the authorities.

28 The stack: tier 2, high level ops. High level operations Analyse paper submitted Text analysis. Find authorities for most important topics in the paper Pagerank like algorithm if citations are included in graph Find social network of authors and eliminate them from the authorities Two hop algorithm, 1st hop my papers, 2nd hop my coauthors Graph substraction.

29 The stack: tier 3, low level ops. Basic analysis: Get node/edge Get attributes from a node or an edge Get neighbors Node degree Basic transformations Add/delete node/edge Add/delete/update attribute

30 Summary of graph operations D. Dominguez-Sal, N. Martinez-Bazan, V. Muntés-Mulero, P. Baleta, and J-Ll. Larriba-Pey. A discussion on the design of graph database benchmarks. In Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems (TPCTC'10). Springer-Verlag, Berlin, Heidelberg,

31 Initial Conclusions Graphs must be in general Attributed, labeled (types), directed, multigraph Operations on graphs Many operations require to access the whole graph Significant set of cascaded operations Many queries access both the structure and attributes Candidate scenarios to benefit from this technology: any scenario where: Large graph datasets, variety of operations, industrial interest

32 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations. Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

33 Sparksee, high performance technology Norbert Martínez-Bazán, Arnau Prat Pérez, David Dominguez Sal, and Josep Lluis Larriba Pey et al. DAMA-UPC/Sparsity Technologies N. Martínez-Bazan et al: Dex: high-performance exploration on large graphs for information retrieval. CIKM 2007: N. Martínez-Bazan et al: Efficient graph management based on bitmap indices. IDEAS 2012: R. Angles, A. Prat-Pérez, D. Dominguez-Sal, J.-L. Larriba Pey: Benchmarking database systems for social network applications. GRADES 2013: 15

34 Classical Graph Representation Adjacency matrix Good: Static, good locality for dense graphs (not common), bit based representation Bad: Prohibitively expensive size, neighbors performance, add vertex

35 Classical Graph Representation Adjacency list Good: size for sparse graphs, neighbors performance Bad: Slower edge exists between two vertices, dynamic memory

36 Classical Graph Representation Several limitations No labels Node and edge attributes Multigraphs Good for in memory

37 Relational database Implementation Store each type of node in a table with its associated attributes Store each type edge in a table with its associated attributes Navigation through edges is resolved by join operations Query Language: SQL Not suitable for path traversals and graph exploration Eg: DB2, Oracle, SQLServer, MySQL, Postgres.

38 Resource Description Framework Resource Description Framework (RDF) Triple format: Subject, Predicate, Object All data is represented as triples, queries are patterns on such triples Edge: a triple where subject and object are nodes Attributes: a triple where subject and object are nodes Query language: SPARQL Not natural representation Eg: Virtuoso, RDF-3X, Sesame and also RDBMS start to support RDF (DB2, Oracle )

39 New approaches Graph Databases: Graph / edge types Specialized graph API or language (eg. Cypher, Gremlin) Out of core functionalities with buffer pool Eg. Sparksee, Neo4j, OrientDB, Hypergraph, Infinitegraph Distributed graph analysis for large datasets Map-reduce (Pegasus) Vertex-centric computation model (Pregel, Giraph, GraphLab )

40 Requirements Data and schema represented as a graph Data operations based on graph operations Graph-based integrity restrictions Multigraphs Attributes attached to both vertices and edges Graph queries combining edge traversals with attribute accesses Diversity of workloads Efficient secondary memory management

41 Sparksee Main features Graph split into small structures Move to main memory just significant parts (caching) Object identifiers (oids) instead of complex objects Reduce memory requirements Specific structures to improve traversals Index the edges and the neighbors of each node Attribute indices Improve queries based on value filters Implemented in C++ Different APIs (Java,.NET, etc.) through wrappers

42 Sparksee Capabilities Efficiency very compact representation using bitmaps. Highly compressible data structures. Capacity more than 100 billion vertices and edges in a single multicore computer. Performance subsecond response in recommendation queries. Scalability high throughput for concurrent queries. Consistency partial transactional support with recovery. Multiplatform Linux, Windows, MacOSX, Mobile

43 Sparksee Architecture SparkseePhyton

44 Graph representation We define a graph G =(V,E,L,T,H,A1,,Ap) as: LABELS L = {(o, l ) o (V E ) l string} TAILS T = {(e, t ) e E t V } HEADS H = {(e, h) e E h V } ATTRIBUTES Ai = {(o, c ) o (V E ) c {int, string,...}} With this representation: the graph is split into multiple lists of pairs the first element of each pair is always a vertex or an edge

45 Graph representation L (v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS),(e6, CONTAINS), (e7, CONTAINS) T (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) H (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) A_id (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) A_title (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) A_nlc (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en), (e2, en) A_filename (v5, europe.png), (v6, bcn.jpg) A_tag (e4, continent)

46 Value sets Groups all pairs of the original set with the same value as a pair between the value and the set of objects with such value. L T H Aid Atitle Anlc v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS), (e6, CONTAINS), (e7, CONTAINS) (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en),(e2, en) (ARTICLE, {v1, v2, v3, v4}), (BABEL, {e1, e2}), (CONTAINS, {e5, e6, e7}), (IMAGE, {v5, v6}), (REF, {e3, e4}) (v1, {e1}), (v2, {e2}), (v3, {e5, e6}), (v4, {e3, e4, e7}) (v3, {e1, e2, e3, e4}), (v5, {e5}), (v6, {e6, e7}) (1, {v1, v5}), (2, {v2, v6}), (3, {v3}), (4, {v4}) (Barcelona, {v4}), (Europa, {v1}), (Europe, {v2, v3}) (ca, {v1}), (en, {v3, v4, e1, e2}), (fr, {v2}) Afilename (v5, europe.png), (v6, bcn.jpg) (bcn.jpg, {v6}), (europe.png, {v5}) Atag (e4, continent) (continent, {e4})

47 Bitmap Index Each vertex and edge is identified by a unique and immutable oid (object identifier) Each vertex or edge set is stored in a bitmap structure: Each position in a bitmap corresponds to the oid of an object Reduced amount of space (compression techniques) Very efficient binary logic operations Eg: ARTICLE, {v1, v2, v3, v4}) -> Article:

48 Link: the basic internal structure New way to represent a graph that allows for out-of-core management: Graph is split into smaller structures to favor the caching of significant parts Object identifiers are used to reduce memory requirements (OIDS) Specific structures to help in the navigation and traversal of edges Attributes are fully indexed to allow queries based on filters Bitmaps a 1 oid value oid b c Maps Link

49 Example of a bitmap based representation

50 Integrity rules

51 Value set operations Domain: returns the set of distinct values Objects: returns the set of vertices or edges associated to a value Lookup: returns the set of values associated to a set of objects Insert/Remove: Add/Remove a vertex adds a vertex or edge to the collection

52 Graph query examples Number of articles objects (LABELS, ARTICLE ) Out-degree of English article Europe objects (TAILS, objects( TITLE, Europe ) objects (NLC, en ) objects (LABELS, ARTICLE )) Articles with references to the image with filename bcn.jpg {lookup(tails, x ) x objects (HEAD, objects (FILENAME, bcn.jpg ) objects (LABELS, IMAGE ))} Count the articles of each language {(x, y ) x domain(nlc) y = (objects (NLC, x ) objects (LABELS, ARTICLE )) }

53 Implementation details Bitmaps are compressed by grouping the bits into clusters of 32 consecutive bits (up to 137 billion objects per graph) Locality is improved by generating consecutive oids for each distinct vertex or edge labels Sorted tree structure of bitmap clusters to speedup the insert, remove, and binary logic operations Maps are implemented using B+ trees The tail, head and attribute value sets have been split into specific value sets for each label

54 Evaluation (GRADES 13) Objective: understand how small queries stress different DBs: Graph, RDF, Relational Data schema based on social network data ID name location age ID URL creation Person Like WebPage Friend Graph data generator based on R-Mat In collaboration with R. Angles, U. Talca

55 Query set Selection (Q1) Get all the persons having a name N (Q4) Get the name of the person with a given PID Adjacency (Q2) Get all the persons who like a given webpage W (Q3) Get the webpages that person P likes Pattern matching (Q10) Get the common friends between persons P1 and P2 (Q11) Get the common webpages that persons P1 and P2 like Fixed-length path (Q5) Get the friends of the friends of a given person P (Q6) Get the webpages liked by the friends of a given person P (Q7) Get persons that like a webpage which a person P likes Reachability (Q8) Is there a friend connection between two persons? (Q9) Get the shortest path between two persons Summarization (Q12) Get the number of friends of a person P

56 Experiments: systems DB System DB Type Implementation Query language Dex (v4.7) Graph API - Neo4j (v1.8.2) Graph API Cypher RDF-3X (v0.3.7) RDF Java driver SPARQL Virtuoso (v7.0) RDF / Column store Java driver Stored procedures SQL + Extension Virtuoso/PL PostgreSQL (v9.1) Row based Java driver Stored procedures SQL PL/PgSQL Test environment Hardware: Intel Xeon E GHz, 32 GB RAM, 1TB HD Software: Linux Debian amd64 kernel, ext3 file system.

57 Experiments: query execution test

58 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

59 On Demand Memory Specialization for Distributed Graph Databases Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep-Lluis Larriba-Pey Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep- Lluis Larriba-Pey: On Demand Memory Specialization for Distributed Graph Databases. CoRR abs/ (2013)

60 Motivation Large Graphs: Need to be stored and processed in parallel/distributed computers Difficult to partition Partition functions Favour specific operations Have to be executed for different environments Are costly to execute

61 Contributions System design in two levels Physical storage Memory management Data access pattern monitoring Specific data structure Load and network balancing Increased throughput 66

62 System Overview Memory management Storage

63 Partition Manager We propose a new data structure Monitors data access patterns Uses this information in a simple way to decide how to route queries Matrix of data access sequences New compressed data structure 68

64 Partition Manager We propose a new data structure Monitors data access patterns Uses this information in a simple way to decide how to route queries Matrix of data access sequences New compressed data structure

65 Experiments Scalability with cluster size Tested up to 32 machines Systems compared Static partitioning Dynamic partitioning (ours) R-MAT graph 37M vertices 1B edges Queries: BFS and k-hops

66 Experiments: BFS Throughput (more better) Load Imbalance (less better) 71

67 Experiments

68 Experiments Average Response Time (less better)

69 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

70 Shaping Communities out of Triangles Arnau Prat Pérez, David Dominguez Sal, Josep Maria Brunat and Josep Lluis Larriba Pey DAMA-UPC A. Prat-Pérez, D. Dominguez-Sal, J.L. Larriba-Pey: High quality, scalable and parallel community detection for large real graphs. WWW 2014: A. Prat-Pérez, D. Dominguez-Sal, J. M. Brunat, J.L. Larriba-Pey: Shaping communities out of triangles. CIKM 2012:

71 Community detection Community Informal definition set of nodes well connected among them, but not with other nodes No agreement on a definition

72 State of the art Community Detection In general, state of the art metrics perform well Modularity Conductance However, under certain circumstances, they fail. Treelike structures Clique chains Algorithms, based on those metrics, fail to provide quality and performance Reason: they focus on edges but ignoring the internal structure

73 Our goal Find algorithms with a strong focus on: Quality Scalability Parallelism First propose a metric: Weighted Community Clustering. Then, propose algorithm: Scalable Comunity Detection (SCD). First proposal for scalable disjoint community detection algorithm for SMP architectures. Undirected graph without attributes.

74 Weighted Community Clustering We take triangles as the key factor of a community structure.

75 Weighted Community Clustering t(x,s) is the number of triangles that vertex x closes with vertices in a set S. vt(x,s) is the number of vertices of a set S that form at least one triangle with x. The WCC of is the average WCC of its vertices.

76 Properties Maximizing WCC, fulfils minimum properties: Internal Structure, triangles. Linear Community Cohesion, number of connections of a node Bridges, no bridges secured. Cut Vertex Density, minimum density secured.

77 Algorithm overview

78 Experimental setup Real graphs with ground truth communities. Average F1Score and NMI to measure quality. Baseline with disjoint and overlapping algorithms: Walktrap, Infomap, Louvain, Bigclam and Oslom. Intel Xeon Quad Core 2.4 Ghz with 32 GB of RAM. Nodes Edges Amazon 0.3 M 0.9 M dblp 0.3 M 1.0 M Youtube 1.1 M 2.9 M Livejournal 3.9 M 34.6 M Orkut 3 M M Friendster 65 M 1806 M

79 Execution Time All executions are single threaded. Executions longer than one week were aborted.

80 F1Score

81 Complexity Friendster graph in 4.3 hours! Intel Xeon Quad Core 2.4 Ghz with 32 GB of RAM.

82 Parallelism Parrallelization with OpenMP.

83 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

84 Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval Joan Guisado-Gámez, David Dominguez-Sal, Josep-Lluis Larriba-Pey Joan Guisado-Gámez, David Dominguez-Sal, Josep-Lluis Larriba-Pey: Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval. ICMR 2014: 33

85 Introduction Quality depends on the user s skills Short keywords based queries Not exact, correct and complete set of keywords.

86 Query Expansion Process of transforming Q O into Q E Detect expansion features (terms/phrases) What kind of expansion features? How to obtain the expansion features?

87 Overview

88 Path detection θ=colored Volkswagen beetles κ=volkswagen beetles in any color for example, red, blue, green or yellow. Volkswagen volkswagen beetle volkswagen fox volkswagen beetle volkswagen passat volkswagen beetle volkswagen type 2 volkswagen beetle volkswagen golf volkswagen beetle volkswagen jetta volkswagen beetle volkswagen touareg volkswagen beetle volkswagen golf mk4 volkswagen beetle volkswagen beetle volkswagen transporter

89 Build communities Build communities around paths Use paths as seeds Communities retrieve the closest related concepts to the path May not be directly connected to the concept

90 Example Query Colored volkswagen beetles From Topological Expansion: Volkswagen beetle, Volkswagen fusca, VW type 1, Volkswagen 1200, Volkswagen bug, Volkswagen super bug, VW bug, VW Käfer, Volkswagen new Beetle, Baja bug, Volkswagen group, VW group, H-Shifter, engine, car, automobile,. From Redirect-based Expansion: VW Beetles, VW Beetle, VW Käfer, Volkswagon Beetles, Volkswagon Beetle, Volkswagon Käfer, Volkswagen Beetles, Volkswagen Beetle, Volkswagen Käfer Hundreds of expansion terms

91 Enrichment No enrichment Results Colored volkswagen beetles

92 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

93 LDBC Social Netwok Benchmark

94 Why benchmarking Two main objectives: Allow final users to assess the performance of the software they want to buy Push the technology to the limit to allow for progress Main effort in DB benchmarking up to now TPC: Transaction Processing Performance Council Relational DBs: Transactional and DSS

95 What is LDBC? Non-lucrative organization formed in 2012 from an European Union Project. Defines and develops benchmarks for graphlike data management technologies: Graph Data Base Management Systems (GDBMS) RDF Systemes Graph Processing Frameworks LDBC provides: All the documentation (workload definitions, running instructions, disclosure guidelines) of the benchmark it develops All the necessary data and software to run the benchmarks

96 What is LDBC? Participated by principal actors in graph data management and RDF:

97 Objectives: Benchmarks for the emerging field of RDF and Graph database management systems (GDBs) Spur industry cooperation around benchmarks LDBC foundation created and operative on 2Q Benchmarks created: Semantic Publishing Benchmark (SPB) Social Network Benchmark (SNB) Web site: Software repositories in Github

98 What is LDBC? Currently developing two benchmarks: Social Network Benchmark (LDBC-SNB): for testing the performance of graph databases and graph processing frameworks, inspired by the management of a social network Social Publishing Benchmark (LDBC-SPB): for testing the performance of RDF engines inspired by the Media/Publishing industry For more information, visit:

99 The Social Network Benchmark LDBC-SNB[1] Philosophy: Rich coverage Modularity Small implementation cost Relevance Reproducibility Open source, you may do whatever you want with it!!! Mimics the operation of a real social network: Simple to understand Allows the testing of a complete range of interesting challenges Can be easily scaled [1]

100 The SNB workloads Interactive: Interactive queries representing the interaction of the users with the social network Low latency and multiple concurrent users Small data accessed 14 read and 8 update Business Intelligence: Complex structured queries for analyzing online behavior of users for marketing purposes Non interactive queries Moderate data accessed 22 queries Graph Analytics: Expensive graph analytical queries (PageRank, Centrality, Clustering ) Large data accessed Not defined yet Being designed

101 LDBC-SNB

102 DATA SCHEMA

103 INTERACTIVE Consists of 14 read queries and 8 update queries Read Query Read Query 1 Read Query 2 Read Query 3 Read Query 4 Read Query 5 Read Query 6 Read Query 7 High level description Friends with a certain name Recent posts and comments by your Friends Friends and Friends of Friends that have been in countries X and Y New topics New groups Tag co-occurrence Recent likes

104 INTERACTIVE Read Query Read Query 8 Read Query 9 Read Query 10 Read Query 11 Read Query 12 Read Query 13 Read Query 14 High level description Recent replies Recent posts and comments by Friends and Friends of your Friends Friend reccomendation Job referral Expert search Single shortest path Weighted paths

105 INTERACTIVE Update Query Update Query 1 Update Query 2 Update Query 3 Update Query 4 Update Query 5 Update Query 6 Update Query 7 Update Query 8 High level description Add Person Add Friendship Add Forum Add Forum Membership Add Post Add Like to Post Add Comment Add Like to Comment

106 BUSSINESS INTELLIGENCE Consists of 22 read queries Query Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 High level description Post stats Volume in forums on a subject in a topic The locally going thing Thread length distribution Best publicist Branding hour Market share Cross-border conversation

107 BUSSINESS INTELLIGENCE Query Query 9 Query 10 Query 11 Query 12 Query 13 Query 14 Query 15 Query 16 High level description Person with most posts in foreign language People who studied in the same university Teenagers talking to strangers Liker clique Data quality Find unusual values Development of mindshare Biggest posters on a tag

108 BUSSINESS INTELLIGENCE Query Query 17 Query 18 Query 19 Query 20 Query 21 Query 22 High level description Country bias Posting activity Hub Countries Moving predicates Mole hunt Concept promoter

109 DATAGEN Datasets are synthetically generated using DATAGEN[1] Realistic Scalable Deterministic Usable Datasets simulate a social network s activity during a period of time. Uses Hadoop (MapReduce) to scale to milions of entities. [1]

110 DATAGEN 16 dictionaries extracted from Dbpedia are used to produce correlated attributes. i.e. Names by Country, Tags by Country, Companies by Country, etc Reproduces realistic distributions found in real social networks. i.e. Friendship degree distribution mimics that found in Facebook. Reproduces the homophily principle a.k.a similar people tend to be connected. Generates substitution parameters for the queries of the workloads. Implemented with Hadoop to generate huge datasets

111 SCALE FACTORS We define a set of scale factors: SF1, SF3, SF10, SF1000 Different scale factors target systems of different sizes and characteristics From single node machines to large clusters. Each scale factor represents a data set of a different size, in gigabytes: SF Persons Activity Size SF years 1GB SF years 10GB SF years 100GB

112 DATAGEN degree dist. Facebook [1] [1] -data-team/anatomy-of- Datagen SF1

113 LDBC Execution Driver We provide the SNB execution driver[1]: Easy to extend Small implementation cost Multiple concurrent threads Each query type is assigned an interleave interval. Is the time between two query instances of the same type are issued Queries vary in complexity. We must ensure no one dominates execution time Interleave intervals are extracted experimentally Tracks dependencies between update queries automatically: i.e. We cannot insert a post before its autor is inserted Automatically gathers and reports performance metrics [1]

114 Conclusions LDBC defines and develops benchmarks for graph-like data management Relevant Easy to adopt Two benchmarks: SNB with three different workloads for different types graph management systems SPB for RDF engines LDBC-SNB provides: A realistic social network data generator An execution driver All software is open source

115 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions

116 Conclusions GRAPH management is becoming important Different applications where traversals, pattern matching and complex graph operations are needed Performance is an issue Building high performance systems: Sparksee, vertical store based on bitmap processing Distributed management of graphs Community search: important to take structure into account LDBC-SNB provides: Benchmarks relevant for industry Provides two tastes of benchmarks, SPB and SNB Open to everybody to use, explore, and propose new ideas in benchmarking

117 Any questions? Contact DAMA Group Web Site:

118 Sparksee evaluation (IDEAS) Additional material

119 Commercial Graph DBMSs

120 DEX: Some internal details All the structures presented have been implemented in the current version of the DEX core engine. Support for 37-bit unsigned integer nids and eids, more than 137 billion objects per graph. Identifiers clustered in groups for each node or edge type. Bitmaps are compressed by grouping the bits in clusters of 32 consecutive bits only if at least one bit set are stored. Maps are implemented using B+ trees N. Martínez-Bazán, V. Muntés-Mulero, S. Gómez-Villamor, J. Nin, M. Sánchez-Martínez, and J.Ll. Larriba-Pey. DEX: High Performance Exploration on Large Graphs for Information Retrieval. In Proceedings of the 16th ACM Conference on Information and Knowledge Management Conference (CIKM), Lisbon, pages , N. Martínez-Bazan, S. Gómez-Villamor, V. Muntés-Mulero and J. Ll. Larriba-Pey. Procedure to represent and manipulate multigraphs based on the use of bitmaps. Spanish Patent and Trademark Office, Patent Pending #40444, 22/July/2008. N. Martínez-Bazán. DEX: High-Performance Graph Databases. Master Thesis in Computer Architecture and Network Systems, Universitat Politècnica de Catalunya, Barcelona, September/2008.

121 Performance and Memory Usage Benchmark: Wikipedia from 254 different languages with 57 million articles, 2.1 million images, more than 321 million links, and 483 million attribute values. (17 GB) Query Time (s) Traversals Bitmaps (MB) Mem. usage (MB) find the article with the largest outdegree Q1 large SPT (BFS) and find a shortest-path tree (SPT) only recommend considering related the edges articles of type to the REF most Q2 top-k 2-hop popular one find new images for articles in other Q3 pattern matching find, for each different languages language, the number of Q4 group by & count articles and the number of images referenced for Q5 update 2.1M by each those article, articles materialize without an repetition attribute indicating the number of images contained, Q6 delete ~90% remove only if it all contains the articles more without than one any image. image Experiments performed using a computer with two quadcore Intel(R) Xeon(R) E5440 at 2.83 GHz. The memory hierarchy is 6144 KB second level cache, 64 GB main memory, and a disk with 1.7 TB. The operating system is Linux Debian etch 4.0

122 Analysis of the Distribution of Bitmaps 99.97% of the bitmaps are smaller than 250 bytes and occupy only 32% of data

123 Analysis of Bitmap Usage

124 Comparison with Other Approaches In-memory MonetDb Neo4J DEX Data (GB) Load time (hours) Q1 large SPT (BFS) Q2 top-k 2-hop Q3 pattern matching Q4 group by & count Q5 update 2.1M Q6 delete ~90% > 1 week

125 Scalability

126 Scalability

127 SPB Benchmark - Sparksee on RDF Additional material

128 Schema

129 Choke points 1. CP1 Join ordering 2. CP2 Aggregation 3. CP3 Optional and nested optional clauses 4. CP4 Reasoning 5. CP5 Parallel execution of unions 6. CP6 Optional with filters 7. CP7 Ordering 8. CP8 Geo-spatial predictes 9. CP9 Full text search 10. CP10 Duplicate elimination 11. CP11 Complex filter conditions

130 Queries (I)

131 Queries (II)

132 Characteristics of SPB queries

133 Results

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015 E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing

More information

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) ! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and

More information

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

More information

Microblogging Queries on Graph Databases: An Introspection

Microblogging Queries on Graph Databases: An Introspection Microblogging Queries on Graph Databases: An Introspection ABSTRACT Oshini Goonetilleke RMIT University, Australia [email protected] Timos Sellis RMIT University, Australia [email protected]

More information

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1 Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012 Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships

More information

Graph Databases. Prad Nelluru, Bharat Naik, Evan Liu, Bon Koo

Graph Databases. Prad Nelluru, Bharat Naik, Evan Liu, Bon Koo Graph Databases Prad Nelluru, Bharat Naik, Evan Liu, Bon Koo 1 Why are graphs important? Modeling chemical and biological data Social networks The web Hierarchical data 2 What is a graph database? A database

More information

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation

More information

Review of Graph Databases for Big Data Dynamic Entity Scoring

Review of Graph Databases for Big Data Dynamic Entity Scoring Review of Graph Databases for Big Data Dynamic Entity Scoring M. X. Labute, M. J. Dombroski May 16, 2014 Disclaimer This document was prepared as an account of work sponsored by an agency of the United

More information

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource

More information

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1 GRAPH DATABASE SYSTEMS h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1 Use Case: Route Finding Source: Neo Technology, Inc. h_da Prof. Dr. Uta Störl Big Data Technologies:

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what

More information

Bigtable is a proven design Underpins 100+ Google services:

Bigtable is a proven design Underpins 100+ Google services: Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

HIGH PERFORMANCE BIG DATA ANALYTICS

HIGH PERFORMANCE BIG DATA ANALYTICS HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning

More information

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB bankmark UG (haftungsbeschränkt) Bahnhofstraße 1 9432 Passau Germany www.bankmark.de [email protected] T +49 851 25 49 49 F +49 851 25 49 499 NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB,

More information

Benchmarking graph databases on the problem of community detection

Benchmarking graph databases on the problem of community detection Benchmarking graph databases on the problem of community detection Sotirios Beis, Symeon Papadopoulos, and Yiannis Kompatsiaris Information Technologies Institute, CERTH, 57001, Thermi, Greece {sotbeis,papadop,ikom}@iti.gr

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

Systems and Algorithms for Big Data Analytics

Systems and Algorithms for Big Data Analytics Systems and Algorithms for Big Data Analytics YAN, Da Email: [email protected] My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics

FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics André Petermann 1,2, Martin Junghanns 1, Robert Müller 2 and Erhard Rahm 1 1 University of Leipzig {petermann,junghanns,rahm}@informatik.uni-leipzig.de

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.

More information

White Paper. Optimizing the Performance Of MySQL Cluster

White Paper. Optimizing the Performance Of MySQL Cluster White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....

More information

Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data Collaborative Project LOD2 Creating Knowledge out of Interlinked Data Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months Deliverable 2.1.4 150 Billion Triple dataset hosted on

More information

Google Cloud Data Platform & Services. Gregor Hohpe

Google Cloud Data Platform & Services. Gregor Hohpe Google Cloud Data Platform & Services Gregor Hohpe All About Data We Have More of It Internet data more easily available Logs user & system behavior Cheap Storage keep more of it 3 Beyond just Relational

More information

A Comparison of Current Graph Database Models

A Comparison of Current Graph Database Models A Comparison of Current Graph Database Models Renzo Angles Universidad de Talca (Chile) 3rd Int. Workshop on Graph Data Management: Techniques and applications (GDM 2012) 5 April, Washington DC, USA Outline

More information

Business Application Services Testing

Business Application Services Testing Business Application Services Testing Curriculum Structure Course name Duration(days) Express 2 Testing Concept and methodologies 3 Introduction to Performance Testing 3 Web Testing 2 QTP 5 SQL 5 Load

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Real Life Performance of In-Memory Database Systems for BI

Real Life Performance of In-Memory Database Systems for BI D1 Solutions AG a Netcetera Company Real Life Performance of In-Memory Database Systems for BI 10th European TDWI Conference Munich, June 2010 10th European TDWI Conference Munich, June 2010 Authors: Dr.

More information

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Graph Database Performance: An Oracle Perspective

Graph Database Performance: An Oracle Perspective Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective

More information

Integrating Open Sources and Relational Data with SPARQL

Integrating Open Sources and Relational Data with SPARQL Integrating Open Sources and Relational Data with SPARQL Orri Erling and Ivan Mikhailov OpenLink Software, 10 Burlington Mall Road Suite 265 Burlington, MA 01803 U.S.A, {oerling,imikhailov}@openlinksw.com,

More information

Presto/Blockus: Towards Scalable R Data Analysis

Presto/Blockus: Towards Scalable R Data Analysis /Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew

More information

Performance And Scalability In Oracle9i And SQL Server 2000

Performance And Scalability In Oracle9i And SQL Server 2000 Performance And Scalability In Oracle9i And SQL Server 2000 Presented By : Phathisile Sibanda Supervisor : John Ebden 1 Presentation Overview Project Objectives Motivation -Why performance & Scalability

More information

An empirical comparison of graph databases

An empirical comparison of graph databases An empirical comparison of graph databases Salim Jouili Eura Nova R&D 1435 Mont-Saint-Guibert, Belgium Email: [email protected] Valentin Vansteenberghe Universite Catholique de Louvain 1348 Louvain-La-Neuve,

More information

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013 U.S. National Security Agency Research Directorate - R6 Technical Report NSA-RD-2013-056002v1 May 20, 2013 Graphs are everywhere! A graph is a collection of binary relationships, i.e. networks of pairwise

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks

More information

Data warehousing with PostgreSQL

Data warehousing with PostgreSQL Data warehousing with PostgreSQL Gabriele Bartolini http://www.2ndquadrant.it/ European PostgreSQL Day 2009 6 November, ParisTech Telecom, Paris, France Audience

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Performance Analysis of Web based Applications on Single and Multi Core Servers

Performance Analysis of Web based Applications on Single and Multi Core Servers Performance Analysis of Web based Applications on Single and Multi Core Servers Gitika Khare, Diptikant Pathy, Alpana Rajan, Alok Jain, Anil Rawat Raja Ramanna Centre for Advanced Technology Department

More information

LDIF - Linked Data Integration Framework

LDIF - Linked Data Integration Framework LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany [email protected],

More information

Optimizing the Performance of Your Longview Application

Optimizing the Performance of Your Longview Application Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not

More information

How graph databases started the multi-model revolution

How graph databases started the multi-model revolution How graph databases started the multi-model revolution Luca Garulli Author and CEO @OrientDB QCon Sao Paulo - March 26, 2015 Welcome to Big Data 90% of the data in the world today has been created in the

More information

Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Data Warehouse Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

More information

Social Media Mining. Graph Essentials

Social Media Mining. Graph Essentials Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures

More information

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction

More information

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition Liferay Portal Performance Benchmark Study of Liferay Portal Enterprise Edition Table of Contents Executive Summary... 3 Test Scenarios... 4 Benchmark Configuration and Methodology... 5 Environment Configuration...

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

STINGER: High Performance Data Structure for Streaming Graphs

STINGER: High Performance Data Structure for Streaming Graphs STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on

More information

Trafodion Operational SQL-on-Hadoop

Trafodion Operational SQL-on-Hadoop Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Information Processing, Big Data, and the Cloud

Information Processing, Big Data, and the Cloud Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April 2009. Page 1 of 12

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April 2009. Page 1 of 12 XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines A.Zydroń 18 April 2009 Page 1 of 12 1. Introduction...3 2. XTM Database...4 3. JVM and Tomcat considerations...5 4. XTM Engine...5

More information

http://support.oracle.com/

http://support.oracle.com/ Oracle Primavera Contract Management 14.0 Sizing Guide October 2012 Legal Notices Oracle Primavera Oracle Primavera Contract Management 14.0 Sizing Guide Copyright 1997, 2012, Oracle and/or its affiliates.

More information

Oracle8i Spatial: Experiences with Extensible Databases

Oracle8i Spatial: Experiences with Extensible Databases Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction

More information

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL

More information

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits SharePoint Server 2010 Capacity Management: Software Boundaries and s This document is provided as-is. Information and views expressed in this document, including URL and other Internet Web site references,

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Bigdata : Enabling the Semantic Web at Web Scale

Bigdata : Enabling the Semantic Web at Web Scale Bigdata : Enabling the Semantic Web at Web Scale Presentation outline What is big data? Bigdata Architecture Bigdata RDF Database Performance Roadmap What is big data? Big data is a new way of thinking

More information

Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model. An Oracle Technical White Paper May 2005

Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model. An Oracle Technical White Paper May 2005 Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model An Oracle Technical White Paper May 2005 Building GIS Applications Using the Oracle Spatial Network Data Model

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Distance Degree Sequences for Network Analysis

Distance Degree Sequences for Network Analysis Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information