Graph Databases Josep Lluis Larriba Pey David Dominguez Sal
|
|
|
- Meagan Barker
- 10 years ago
- Views:
Transcription
1 Graph Databases Josep Lluis Larriba Pey David Dominguez Sal Norbert Martinez Bazán
2 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Graph technology and management: Sparksee 5. Basic operations: Community search 6. Parallel and distributed graph management 7. Applications: enriching queries 8. Benchmarking: LDBC 9. Conclusions
3 DAMA-UPC: research universitary group Our goal Research and Technology Transfer on Managing Very Large Data Volumes The beginnings The team Since 1999 Founded inside UPC 14 researchers and developers Our software Awards DEX: massive graph management IBM Faculty Awards (2004, 2009) IBM PhD Award (2004) CINC prize for novel enterpreneurs (2009) Funding, agreements and collaborators Publications > 70 research papers 4 patents
4 The DAMA-UPC/Sparsity ecosystem Research Graph Mgmt Community detection Parallel graph Mgmt Keyword enrichment Help Apps & Technology Development European projects: LDBC, CoherentPaaS, Tetracom Research Applications Why Sparsity? Social Good vehicle analytics to: Tweeticer Put our technologies in the market Create synergies Social Media with other companies New Allow mobile flexibility Apps with DAMA-UPC: EC projects, hiring of talent Technology Development *Sparsity Technologies Applications Technology Development Graph Mgmt DEX, original GDB Collaboration with Sparsity Support for research Support to Apps Technology Transfer What Technology How does is Sparsity Transfer commercialise? organised? Sparksee Gather Technology 5.1 (linux, the needs Windows, from Companies MacOS) (C++, Java, Python, Research.Net) Collaborations with: IBM, Sparksee Marketing 5.1 mobile Oracle, and (Android, CA Sales Technologies, ios, BB10) Tweeticer Customer Media support Planning, BMAT, etc. Daurum Put our technology in the market
5 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Graph technology and management: Sparksee 5. Basic operations: Community search 6. Parallel and distributed graph management 7. Applications: enriching queries 8. Benchmarking: LDBC 9. Conclusions
6 Graphs are everywhere!
7 Graphs are everywhere! Interest for the structural analysis of entity relationship in many scenarios is growing dramatically The amount of applications calling for efficient large graph management is increasing everyday Handling and querying such large graphs efficiently becomes essential Graph Mining techniques are emerging for analyzing graph-like data and extracting information
8 Motivating Queries Bibliographic Exploration Authority discovery: Who receives more cites from reputed authors? Who is more central to a topic? Social network of an author Determination of the life of a topic: How a topic evolved backward and forward (citations)? Reviewers recommendation Who is knows about a topic better? Who is authoritative? Who did some authors collaborate with?
9 Motivating Queries Social Network Analysis Determination of an actor s centrality in a network Degree centrality Closeness centrality ((normalized) inverse of the average distance metric) Betweeness centrality Determination of the role of an actor in a network Actors playing a particular social role have to be equivalent/similar to each other Structural equivalence (based on neighborhood) Differences with Graph Mining: Graphs are usually smaller Power laws are not observed Efficient algorithms are not an issues
10 Motivating queries (cont d) Root cause analysis Distributed computing environment Determination of the cause of errors Pattern discovery Spread of errors Prediction of malfunctioning Alerts of possible malfunctioning
11 Small, large and huge graphs Tweets: 177M tweets/day (04/11) x 365 > 64B tweets/year 500M tweets/day (10/12) x 365 > 190B/year 20K average papers/day x 365 > 7M/year, Scopus Internet users: from 1995 to 2010 to 2013: a growth of +100 times the number of internet users, 16M to > 2B to > 2.8B, 29% of the world s population in 2010 and 39% in B users send >400B s/day 3B phone calls/day in the USA as
12 Graphs are everywhere! Two models for graph processing Transactional like environments, where queries are oriented to who is.., social network of Restrictions on the data google like queries, person oriented queries. Problems. Privacy, real time aggregation of answers from diff. sources Graph analytical processing environments, where queries are oriented to provide big structural answers Queries on the whole graph like what communities are there in the graph. Problems. How to distribute the graph, parallel implementations
13 What do we need to represent? Labeled: nodes and edges are typed. Nodes authors, papers and keywords. Edges relations and citations. Directed: edges can have a fixed direction. Relations bidirectional, Citations unidirectional. Attributed: nodes and edges can have multiple singlevalued attributes. Nodes authors have profile. Edges citations have a date. Multigraph: two nodes can be connected by multiple edges An author may be related to a department in different moments
14 What do we want from graphs? What are the most common queries that we need to perform on graphs? What kind of processes must be carried out to perform these queries? Can we classify these queries? Is there a set of common underlying operations for all these queries?
15 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging
16 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging
17 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging
18 Example: Usual Operations on Social Networks 2 1 Out-degree Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness 3 2 Betweenness Bridging 1 4 1
19 Example: Usual Queries on Social Networks 4 3 In-degree Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness 3 3 Betweenness Bridging 1 3 3
20 Example: Usual Queries on Social Networks 1-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging
21 Example: Usual Queries on Social Networks 2-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging
22 Example: Usual Operations on Social Networks 3-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging
23 Example: Usual Operations on Social Networks 4-hop Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging
24 Example: Usual Operations on Social Networks Common Social Network ops Neighbors Shortest Path Degree K-hops Connected Components Closeness Betweenness Bridging
25 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions
26 The graph processing stack I would like to determine who could be a good reviewer for a paper I need to compute, among others the difference between two social graphs I need to iterate over nodes, get neighbours through edge navigation, etc
27 The stack: tier 1, applications Example: Bibliography setting Who can be a good reviewer for a paper? Analyse paper submitted Find authorities for most important topics in the paper. Find social network of authors and eliminate them from the authorities.
28 The stack: tier 2, high level ops. High level operations Analyse paper submitted Text analysis. Find authorities for most important topics in the paper Pagerank like algorithm if citations are included in graph Find social network of authors and eliminate them from the authorities Two hop algorithm, 1st hop my papers, 2nd hop my coauthors Graph substraction.
29 The stack: tier 3, low level ops. Basic analysis: Get node/edge Get attributes from a node or an edge Get neighbors Node degree Basic transformations Add/delete node/edge Add/delete/update attribute
30 Summary of graph operations D. Dominguez-Sal, N. Martinez-Bazan, V. Muntés-Mulero, P. Baleta, and J-Ll. Larriba-Pey. A discussion on the design of graph database benchmarks. In Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems (TPCTC'10). Springer-Verlag, Berlin, Heidelberg,
31 Initial Conclusions Graphs must be in general Attributed, labeled (types), directed, multigraph Operations on graphs Many operations require to access the whole graph Significant set of cascaded operations Many queries access both the structure and attributes Candidate scenarios to benefit from this technology: any scenario where: Large graph datasets, variety of operations, industrial interest
32 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations. Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions
33 Sparksee, high performance technology Norbert Martínez-Bazán, Arnau Prat Pérez, David Dominguez Sal, and Josep Lluis Larriba Pey et al. DAMA-UPC/Sparsity Technologies N. Martínez-Bazan et al: Dex: high-performance exploration on large graphs for information retrieval. CIKM 2007: N. Martínez-Bazan et al: Efficient graph management based on bitmap indices. IDEAS 2012: R. Angles, A. Prat-Pérez, D. Dominguez-Sal, J.-L. Larriba Pey: Benchmarking database systems for social network applications. GRADES 2013: 15
34 Classical Graph Representation Adjacency matrix Good: Static, good locality for dense graphs (not common), bit based representation Bad: Prohibitively expensive size, neighbors performance, add vertex
35 Classical Graph Representation Adjacency list Good: size for sparse graphs, neighbors performance Bad: Slower edge exists between two vertices, dynamic memory
36 Classical Graph Representation Several limitations No labels Node and edge attributes Multigraphs Good for in memory
37 Relational database Implementation Store each type of node in a table with its associated attributes Store each type edge in a table with its associated attributes Navigation through edges is resolved by join operations Query Language: SQL Not suitable for path traversals and graph exploration Eg: DB2, Oracle, SQLServer, MySQL, Postgres.
38 Resource Description Framework Resource Description Framework (RDF) Triple format: Subject, Predicate, Object All data is represented as triples, queries are patterns on such triples Edge: a triple where subject and object are nodes Attributes: a triple where subject and object are nodes Query language: SPARQL Not natural representation Eg: Virtuoso, RDF-3X, Sesame and also RDBMS start to support RDF (DB2, Oracle )
39 New approaches Graph Databases: Graph / edge types Specialized graph API or language (eg. Cypher, Gremlin) Out of core functionalities with buffer pool Eg. Sparksee, Neo4j, OrientDB, Hypergraph, Infinitegraph Distributed graph analysis for large datasets Map-reduce (Pegasus) Vertex-centric computation model (Pregel, Giraph, GraphLab )
40 Requirements Data and schema represented as a graph Data operations based on graph operations Graph-based integrity restrictions Multigraphs Attributes attached to both vertices and edges Graph queries combining edge traversals with attribute accesses Diversity of workloads Efficient secondary memory management
41 Sparksee Main features Graph split into small structures Move to main memory just significant parts (caching) Object identifiers (oids) instead of complex objects Reduce memory requirements Specific structures to improve traversals Index the edges and the neighbors of each node Attribute indices Improve queries based on value filters Implemented in C++ Different APIs (Java,.NET, etc.) through wrappers
42 Sparksee Capabilities Efficiency very compact representation using bitmaps. Highly compressible data structures. Capacity more than 100 billion vertices and edges in a single multicore computer. Performance subsecond response in recommendation queries. Scalability high throughput for concurrent queries. Consistency partial transactional support with recovery. Multiplatform Linux, Windows, MacOSX, Mobile
43 Sparksee Architecture SparkseePhyton
44 Graph representation We define a graph G =(V,E,L,T,H,A1,,Ap) as: LABELS L = {(o, l ) o (V E ) l string} TAILS T = {(e, t ) e E t V } HEADS H = {(e, h) e E h V } ATTRIBUTES Ai = {(o, c ) o (V E ) c {int, string,...}} With this representation: the graph is split into multiple lists of pairs the first element of each pair is always a vertex or an edge
45 Graph representation L (v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS),(e6, CONTAINS), (e7, CONTAINS) T (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) H (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) A_id (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) A_title (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) A_nlc (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en), (e2, en) A_filename (v5, europe.png), (v6, bcn.jpg) A_tag (e4, continent)
46 Value sets Groups all pairs of the original set with the same value as a pair between the value and the set of objects with such value. L T H Aid Atitle Anlc v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS), (e6, CONTAINS), (e7, CONTAINS) (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en),(e2, en) (ARTICLE, {v1, v2, v3, v4}), (BABEL, {e1, e2}), (CONTAINS, {e5, e6, e7}), (IMAGE, {v5, v6}), (REF, {e3, e4}) (v1, {e1}), (v2, {e2}), (v3, {e5, e6}), (v4, {e3, e4, e7}) (v3, {e1, e2, e3, e4}), (v5, {e5}), (v6, {e6, e7}) (1, {v1, v5}), (2, {v2, v6}), (3, {v3}), (4, {v4}) (Barcelona, {v4}), (Europa, {v1}), (Europe, {v2, v3}) (ca, {v1}), (en, {v3, v4, e1, e2}), (fr, {v2}) Afilename (v5, europe.png), (v6, bcn.jpg) (bcn.jpg, {v6}), (europe.png, {v5}) Atag (e4, continent) (continent, {e4})
47 Bitmap Index Each vertex and edge is identified by a unique and immutable oid (object identifier) Each vertex or edge set is stored in a bitmap structure: Each position in a bitmap corresponds to the oid of an object Reduced amount of space (compression techniques) Very efficient binary logic operations Eg: ARTICLE, {v1, v2, v3, v4}) -> Article:
48 Link: the basic internal structure New way to represent a graph that allows for out-of-core management: Graph is split into smaller structures to favor the caching of significant parts Object identifiers are used to reduce memory requirements (OIDS) Specific structures to help in the navigation and traversal of edges Attributes are fully indexed to allow queries based on filters Bitmaps a 1 oid value oid b c Maps Link
49 Example of a bitmap based representation
50 Integrity rules
51 Value set operations Domain: returns the set of distinct values Objects: returns the set of vertices or edges associated to a value Lookup: returns the set of values associated to a set of objects Insert/Remove: Add/Remove a vertex adds a vertex or edge to the collection
52 Graph query examples Number of articles objects (LABELS, ARTICLE ) Out-degree of English article Europe objects (TAILS, objects( TITLE, Europe ) objects (NLC, en ) objects (LABELS, ARTICLE )) Articles with references to the image with filename bcn.jpg {lookup(tails, x ) x objects (HEAD, objects (FILENAME, bcn.jpg ) objects (LABELS, IMAGE ))} Count the articles of each language {(x, y ) x domain(nlc) y = (objects (NLC, x ) objects (LABELS, ARTICLE )) }
53 Implementation details Bitmaps are compressed by grouping the bits into clusters of 32 consecutive bits (up to 137 billion objects per graph) Locality is improved by generating consecutive oids for each distinct vertex or edge labels Sorted tree structure of bitmap clusters to speedup the insert, remove, and binary logic operations Maps are implemented using B+ trees The tail, head and attribute value sets have been split into specific value sets for each label
54 Evaluation (GRADES 13) Objective: understand how small queries stress different DBs: Graph, RDF, Relational Data schema based on social network data ID name location age ID URL creation Person Like WebPage Friend Graph data generator based on R-Mat In collaboration with R. Angles, U. Talca
55 Query set Selection (Q1) Get all the persons having a name N (Q4) Get the name of the person with a given PID Adjacency (Q2) Get all the persons who like a given webpage W (Q3) Get the webpages that person P likes Pattern matching (Q10) Get the common friends between persons P1 and P2 (Q11) Get the common webpages that persons P1 and P2 like Fixed-length path (Q5) Get the friends of the friends of a given person P (Q6) Get the webpages liked by the friends of a given person P (Q7) Get persons that like a webpage which a person P likes Reachability (Q8) Is there a friend connection between two persons? (Q9) Get the shortest path between two persons Summarization (Q12) Get the number of friends of a person P
56 Experiments: systems DB System DB Type Implementation Query language Dex (v4.7) Graph API - Neo4j (v1.8.2) Graph API Cypher RDF-3X (v0.3.7) RDF Java driver SPARQL Virtuoso (v7.0) RDF / Column store Java driver Stored procedures SQL + Extension Virtuoso/PL PostgreSQL (v9.1) Row based Java driver Stored procedures SQL PL/PgSQL Test environment Hardware: Intel Xeon E GHz, 32 GB RAM, 1TB HD Software: Linux Debian amd64 kernel, ext3 file system.
57 Experiments: query execution test
58 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions
59 On Demand Memory Specialization for Distributed Graph Databases Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep-Lluis Larriba-Pey Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep- Lluis Larriba-Pey: On Demand Memory Specialization for Distributed Graph Databases. CoRR abs/ (2013)
60 Motivation Large Graphs: Need to be stored and processed in parallel/distributed computers Difficult to partition Partition functions Favour specific operations Have to be executed for different environments Are costly to execute
61 Contributions System design in two levels Physical storage Memory management Data access pattern monitoring Specific data structure Load and network balancing Increased throughput 66
62 System Overview Memory management Storage
63 Partition Manager We propose a new data structure Monitors data access patterns Uses this information in a simple way to decide how to route queries Matrix of data access sequences New compressed data structure 68
64 Partition Manager We propose a new data structure Monitors data access patterns Uses this information in a simple way to decide how to route queries Matrix of data access sequences New compressed data structure
65 Experiments Scalability with cluster size Tested up to 32 machines Systems compared Static partitioning Dynamic partitioning (ours) R-MAT graph 37M vertices 1B edges Queries: BFS and k-hops
66 Experiments: BFS Throughput (more better) Load Imbalance (less better) 71
67 Experiments
68 Experiments Average Response Time (less better)
69 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions
70 Shaping Communities out of Triangles Arnau Prat Pérez, David Dominguez Sal, Josep Maria Brunat and Josep Lluis Larriba Pey DAMA-UPC A. Prat-Pérez, D. Dominguez-Sal, J.L. Larriba-Pey: High quality, scalable and parallel community detection for large real graphs. WWW 2014: A. Prat-Pérez, D. Dominguez-Sal, J. M. Brunat, J.L. Larriba-Pey: Shaping communities out of triangles. CIKM 2012:
71 Community detection Community Informal definition set of nodes well connected among them, but not with other nodes No agreement on a definition
72 State of the art Community Detection In general, state of the art metrics perform well Modularity Conductance However, under certain circumstances, they fail. Treelike structures Clique chains Algorithms, based on those metrics, fail to provide quality and performance Reason: they focus on edges but ignoring the internal structure
73 Our goal Find algorithms with a strong focus on: Quality Scalability Parallelism First propose a metric: Weighted Community Clustering. Then, propose algorithm: Scalable Comunity Detection (SCD). First proposal for scalable disjoint community detection algorithm for SMP architectures. Undirected graph without attributes.
74 Weighted Community Clustering We take triangles as the key factor of a community structure.
75 Weighted Community Clustering t(x,s) is the number of triangles that vertex x closes with vertices in a set S. vt(x,s) is the number of vertices of a set S that form at least one triangle with x. The WCC of is the average WCC of its vertices.
76 Properties Maximizing WCC, fulfils minimum properties: Internal Structure, triangles. Linear Community Cohesion, number of connections of a node Bridges, no bridges secured. Cut Vertex Density, minimum density secured.
77 Algorithm overview
78 Experimental setup Real graphs with ground truth communities. Average F1Score and NMI to measure quality. Baseline with disjoint and overlapping algorithms: Walktrap, Infomap, Louvain, Bigclam and Oslom. Intel Xeon Quad Core 2.4 Ghz with 32 GB of RAM. Nodes Edges Amazon 0.3 M 0.9 M dblp 0.3 M 1.0 M Youtube 1.1 M 2.9 M Livejournal 3.9 M 34.6 M Orkut 3 M M Friendster 65 M 1806 M
79 Execution Time All executions are single threaded. Executions longer than one week were aborted.
80 F1Score
81 Complexity Friendster graph in 4.3 hours! Intel Xeon Quad Core 2.4 Ghz with 32 GB of RAM.
82 Parallelism Parrallelization with OpenMP.
83 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions
84 Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval Joan Guisado-Gámez, David Dominguez-Sal, Josep-Lluis Larriba-Pey Joan Guisado-Gámez, David Dominguez-Sal, Josep-Lluis Larriba-Pey: Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval. ICMR 2014: 33
85 Introduction Quality depends on the user s skills Short keywords based queries Not exact, correct and complete set of keywords.
86 Query Expansion Process of transforming Q O into Q E Detect expansion features (terms/phrases) What kind of expansion features? How to obtain the expansion features?
87 Overview
88 Path detection θ=colored Volkswagen beetles κ=volkswagen beetles in any color for example, red, blue, green or yellow. Volkswagen volkswagen beetle volkswagen fox volkswagen beetle volkswagen passat volkswagen beetle volkswagen type 2 volkswagen beetle volkswagen golf volkswagen beetle volkswagen jetta volkswagen beetle volkswagen touareg volkswagen beetle volkswagen golf mk4 volkswagen beetle volkswagen beetle volkswagen transporter
89 Build communities Build communities around paths Use paths as seeds Communities retrieve the closest related concepts to the path May not be directly connected to the concept
90 Example Query Colored volkswagen beetles From Topological Expansion: Volkswagen beetle, Volkswagen fusca, VW type 1, Volkswagen 1200, Volkswagen bug, Volkswagen super bug, VW bug, VW Käfer, Volkswagen new Beetle, Baja bug, Volkswagen group, VW group, H-Shifter, engine, car, automobile,. From Redirect-based Expansion: VW Beetles, VW Beetle, VW Käfer, Volkswagon Beetles, Volkswagon Beetle, Volkswagon Käfer, Volkswagen Beetles, Volkswagen Beetle, Volkswagen Käfer Hundreds of expansion terms
91 Enrichment No enrichment Results Colored volkswagen beetles
92 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions
93 LDBC Social Netwok Benchmark
94 Why benchmarking Two main objectives: Allow final users to assess the performance of the software they want to buy Push the technology to the limit to allow for progress Main effort in DB benchmarking up to now TPC: Transaction Processing Performance Council Relational DBs: Transactional and DSS
95 What is LDBC? Non-lucrative organization formed in 2012 from an European Union Project. Defines and develops benchmarks for graphlike data management technologies: Graph Data Base Management Systems (GDBMS) RDF Systemes Graph Processing Frameworks LDBC provides: All the documentation (workload definitions, running instructions, disclosure guidelines) of the benchmark it develops All the necessary data and software to run the benchmarks
96 What is LDBC? Participated by principal actors in graph data management and RDF:
97 Objectives: Benchmarks for the emerging field of RDF and Graph database management systems (GDBs) Spur industry cooperation around benchmarks LDBC foundation created and operative on 2Q Benchmarks created: Semantic Publishing Benchmark (SPB) Social Network Benchmark (SNB) Web site: Software repositories in Github
98 What is LDBC? Currently developing two benchmarks: Social Network Benchmark (LDBC-SNB): for testing the performance of graph databases and graph processing frameworks, inspired by the management of a social network Social Publishing Benchmark (LDBC-SPB): for testing the performance of RDF engines inspired by the Media/Publishing industry For more information, visit:
99 The Social Network Benchmark LDBC-SNB[1] Philosophy: Rich coverage Modularity Small implementation cost Relevance Reproducibility Open source, you may do whatever you want with it!!! Mimics the operation of a real social network: Simple to understand Allows the testing of a complete range of interesting challenges Can be easily scaled [1]
100 The SNB workloads Interactive: Interactive queries representing the interaction of the users with the social network Low latency and multiple concurrent users Small data accessed 14 read and 8 update Business Intelligence: Complex structured queries for analyzing online behavior of users for marketing purposes Non interactive queries Moderate data accessed 22 queries Graph Analytics: Expensive graph analytical queries (PageRank, Centrality, Clustering ) Large data accessed Not defined yet Being designed
101 LDBC-SNB
102 DATA SCHEMA
103 INTERACTIVE Consists of 14 read queries and 8 update queries Read Query Read Query 1 Read Query 2 Read Query 3 Read Query 4 Read Query 5 Read Query 6 Read Query 7 High level description Friends with a certain name Recent posts and comments by your Friends Friends and Friends of Friends that have been in countries X and Y New topics New groups Tag co-occurrence Recent likes
104 INTERACTIVE Read Query Read Query 8 Read Query 9 Read Query 10 Read Query 11 Read Query 12 Read Query 13 Read Query 14 High level description Recent replies Recent posts and comments by Friends and Friends of your Friends Friend reccomendation Job referral Expert search Single shortest path Weighted paths
105 INTERACTIVE Update Query Update Query 1 Update Query 2 Update Query 3 Update Query 4 Update Query 5 Update Query 6 Update Query 7 Update Query 8 High level description Add Person Add Friendship Add Forum Add Forum Membership Add Post Add Like to Post Add Comment Add Like to Comment
106 BUSSINESS INTELLIGENCE Consists of 22 read queries Query Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 High level description Post stats Volume in forums on a subject in a topic The locally going thing Thread length distribution Best publicist Branding hour Market share Cross-border conversation
107 BUSSINESS INTELLIGENCE Query Query 9 Query 10 Query 11 Query 12 Query 13 Query 14 Query 15 Query 16 High level description Person with most posts in foreign language People who studied in the same university Teenagers talking to strangers Liker clique Data quality Find unusual values Development of mindshare Biggest posters on a tag
108 BUSSINESS INTELLIGENCE Query Query 17 Query 18 Query 19 Query 20 Query 21 Query 22 High level description Country bias Posting activity Hub Countries Moving predicates Mole hunt Concept promoter
109 DATAGEN Datasets are synthetically generated using DATAGEN[1] Realistic Scalable Deterministic Usable Datasets simulate a social network s activity during a period of time. Uses Hadoop (MapReduce) to scale to milions of entities. [1]
110 DATAGEN 16 dictionaries extracted from Dbpedia are used to produce correlated attributes. i.e. Names by Country, Tags by Country, Companies by Country, etc Reproduces realistic distributions found in real social networks. i.e. Friendship degree distribution mimics that found in Facebook. Reproduces the homophily principle a.k.a similar people tend to be connected. Generates substitution parameters for the queries of the workloads. Implemented with Hadoop to generate huge datasets
111 SCALE FACTORS We define a set of scale factors: SF1, SF3, SF10, SF1000 Different scale factors target systems of different sizes and characteristics From single node machines to large clusters. Each scale factor represents a data set of a different size, in gigabytes: SF Persons Activity Size SF years 1GB SF years 10GB SF years 100GB
112 DATAGEN degree dist. Facebook [1] [1] -data-team/anatomy-of- Datagen SF1
113 LDBC Execution Driver We provide the SNB execution driver[1]: Easy to extend Small implementation cost Multiple concurrent threads Each query type is assigned an interleave interval. Is the time between two query instances of the same type are issued Queries vary in complexity. We must ensure no one dominates execution time Interleave intervals are extracted experimentally Tracks dependencies between update queries automatically: i.e. We cannot insert a post before its autor is inserted Automatically gathers and reports performance metrics [1]
114 Conclusions LDBC defines and develops benchmarks for graph-like data management Relevant Easy to adopt Two benchmarks: SNB with three different workloads for different types graph management systems SPB for RDF engines LDBC-SNB provides: A realistic social network data generator An execution driver All software is open source
115 Outline 1. About us 2. Introduction to graphs 3. The graph processing stack 4. Low level graph operations Sparksee: Graph technology and management 5. High level graph operations. Parallel and distributed graph management 6. Domain specific high level operations. Community search 7. Domain specific operations. Query enrichment 8. Benchmarking: LDBC 9. Conclusions
116 Conclusions GRAPH management is becoming important Different applications where traversals, pattern matching and complex graph operations are needed Performance is an issue Building high performance systems: Sparksee, vertical store based on bitmap processing Distributed management of graphs Community search: important to take structure into account LDBC-SNB provides: Benchmarks relevant for industry Provides two tastes of benchmarks, SPB and SNB Open to everybody to use, explore, and propose new ideas in benchmarking
117 Any questions? Contact DAMA Group Web Site:
118 Sparksee evaluation (IDEAS) Additional material
119 Commercial Graph DBMSs
120 DEX: Some internal details All the structures presented have been implemented in the current version of the DEX core engine. Support for 37-bit unsigned integer nids and eids, more than 137 billion objects per graph. Identifiers clustered in groups for each node or edge type. Bitmaps are compressed by grouping the bits in clusters of 32 consecutive bits only if at least one bit set are stored. Maps are implemented using B+ trees N. Martínez-Bazán, V. Muntés-Mulero, S. Gómez-Villamor, J. Nin, M. Sánchez-Martínez, and J.Ll. Larriba-Pey. DEX: High Performance Exploration on Large Graphs for Information Retrieval. In Proceedings of the 16th ACM Conference on Information and Knowledge Management Conference (CIKM), Lisbon, pages , N. Martínez-Bazan, S. Gómez-Villamor, V. Muntés-Mulero and J. Ll. Larriba-Pey. Procedure to represent and manipulate multigraphs based on the use of bitmaps. Spanish Patent and Trademark Office, Patent Pending #40444, 22/July/2008. N. Martínez-Bazán. DEX: High-Performance Graph Databases. Master Thesis in Computer Architecture and Network Systems, Universitat Politècnica de Catalunya, Barcelona, September/2008.
121 Performance and Memory Usage Benchmark: Wikipedia from 254 different languages with 57 million articles, 2.1 million images, more than 321 million links, and 483 million attribute values. (17 GB) Query Time (s) Traversals Bitmaps (MB) Mem. usage (MB) find the article with the largest outdegree Q1 large SPT (BFS) and find a shortest-path tree (SPT) only recommend considering related the edges articles of type to the REF most Q2 top-k 2-hop popular one find new images for articles in other Q3 pattern matching find, for each different languages language, the number of Q4 group by & count articles and the number of images referenced for Q5 update 2.1M by each those article, articles materialize without an repetition attribute indicating the number of images contained, Q6 delete ~90% remove only if it all contains the articles more without than one any image. image Experiments performed using a computer with two quadcore Intel(R) Xeon(R) E5440 at 2.83 GHz. The memory hierarchy is 6144 KB second level cache, 64 GB main memory, and a disk with 1.7 TB. The operating system is Linux Debian etch 4.0
122 Analysis of the Distribution of Bitmaps 99.97% of the bitmaps are smaller than 250 bytes and occupy only 32% of data
123 Analysis of Bitmap Usage
124 Comparison with Other Approaches In-memory MonetDb Neo4J DEX Data (GB) Load time (hours) Q1 large SPT (BFS) Q2 top-k 2-hop Q3 pattern matching Q4 group by & count Q5 update 2.1M Q6 delete ~90% > 1 week
125 Scalability
126 Scalability
127 SPB Benchmark - Sparksee on RDF Additional material
128 Schema
129 Choke points 1. CP1 Join ordering 2. CP2 Aggregation 3. CP3 Optional and nested optional clauses 4. CP4 Reasoning 5. CP5 Parallel execution of unions 6. CP6 Optional with filters 7. CP7 Ordering 8. CP8 Geo-spatial predictes 9. CP9 Full text search 10. CP10 Duplicate elimination 11. CP11 Complex filter conditions
130 Queries (I)
131 Queries (II)
132 Characteristics of SPB queries
133 Results
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
Microblogging Queries on Graph Databases: An Introspection
Microblogging Queries on Graph Databases: An Introspection ABSTRACT Oshini Goonetilleke RMIT University, Australia [email protected] Timos Sellis RMIT University, Australia [email protected]
Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
Graph Databases. Prad Nelluru, Bharat Naik, Evan Liu, Bon Koo
Graph Databases Prad Nelluru, Bharat Naik, Evan Liu, Bon Koo 1 Why are graphs important? Modeling chemical and biological data Social networks The web Hierarchical data 2 What is a graph database? A database
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation
Review of Graph Databases for Big Data Dynamic Entity Scoring
Review of Graph Databases for Big Data Dynamic Entity Scoring M. X. Labute, M. J. Dombroski May 16, 2014 Disclaimer This document was prepared as an account of work sponsored by an agency of the United
Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world
Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc
Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1
GRAPH DATABASE SYSTEMS h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1 Use Case: Route Finding Source: Neo Technology, Inc. h_da Prof. Dr. Uta Störl Big Data Technologies:
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
Bigtable is a proven design Underpins 100+ Google services:
Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB
bankmark UG (haftungsbeschränkt) Bahnhofstraße 1 9432 Passau Germany www.bankmark.de [email protected] T +49 851 25 49 49 F +49 851 25 49 499 NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB,
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection Sotirios Beis, Symeon Papadopoulos, and Yiannis Kompatsiaris Information Technologies Institute, CERTH, 57001, Thermi, Greece {sotbeis,papadop,ikom}@iti.gr
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and
Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Systems and Algorithms for Big Data Analytics
Systems and Algorithms for Big Data Analytics YAN, Da Email: [email protected] My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics
FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics André Petermann 1,2, Martin Junghanns 1, Robert Müller 2 and Erhard Rahm 1 1 University of Leipzig {petermann,junghanns,rahm}@informatik.uni-leipzig.de
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide
Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.
White Paper. Optimizing the Performance Of MySQL Cluster
White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....
Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data
Collaborative Project LOD2 Creating Knowledge out of Interlinked Data Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months Deliverable 2.1.4 150 Billion Triple dataset hosted on
Google Cloud Data Platform & Services. Gregor Hohpe
Google Cloud Data Platform & Services Gregor Hohpe All About Data We Have More of It Internet data more easily available Logs user & system behavior Cheap Storage keep more of it 3 Beyond just Relational
A Comparison of Current Graph Database Models
A Comparison of Current Graph Database Models Renzo Angles Universidad de Talca (Chile) 3rd Int. Workshop on Graph Data Management: Techniques and applications (GDM 2012) 5 April, Washington DC, USA Outline
Business Application Services Testing
Business Application Services Testing Curriculum Structure Course name Duration(days) Express 2 Testing Concept and methodologies 3 Introduction to Performance Testing 3 Web Testing 2 QTP 5 SQL 5 Load
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Real Life Performance of In-Memory Database Systems for BI
D1 Solutions AG a Netcetera Company Real Life Performance of In-Memory Database Systems for BI 10th European TDWI Conference Munich, June 2010 10th European TDWI Conference Munich, June 2010 Authors: Dr.
Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014
Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Graph Database Performance: An Oracle Perspective
Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective
Integrating Open Sources and Relational Data with SPARQL
Integrating Open Sources and Relational Data with SPARQL Orri Erling and Ivan Mikhailov OpenLink Software, 10 Burlington Mall Road Suite 265 Burlington, MA 01803 U.S.A, {oerling,imikhailov}@openlinksw.com,
Presto/Blockus: Towards Scalable R Data Analysis
/Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew
Performance And Scalability In Oracle9i And SQL Server 2000
Performance And Scalability In Oracle9i And SQL Server 2000 Presented By : Phathisile Sibanda Supervisor : John Ebden 1 Presentation Overview Project Objectives Motivation -Why performance & Scalability
An empirical comparison of graph databases
An empirical comparison of graph databases Salim Jouili Eura Nova R&D 1435 Mont-Saint-Guibert, Belgium Email: [email protected] Valentin Vansteenberghe Universite Catholique de Louvain 1348 Louvain-La-Neuve,
An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013
U.S. National Security Agency Research Directorate - R6 Technical Report NSA-RD-2013-056002v1 May 20, 2013 Graphs are everywhere! A graph is a collection of binary relationships, i.e. networks of pairwise
Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.
Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks
Data warehousing with PostgreSQL
Data warehousing with PostgreSQL Gabriele Bartolini http://www.2ndquadrant.it/ European PostgreSQL Day 2009 6 November, ParisTech Telecom, Paris, France Audience
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Performance Analysis of Web based Applications on Single and Multi Core Servers
Performance Analysis of Web based Applications on Single and Multi Core Servers Gitika Khare, Diptikant Pathy, Alpana Rajan, Alok Jain, Anil Rawat Raja Ramanna Centre for Advanced Technology Department
LDIF - Linked Data Integration Framework
LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany [email protected],
Optimizing the Performance of Your Longview Application
Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not
How graph databases started the multi-model revolution
How graph databases started the multi-model revolution Luca Garulli Author and CEO @OrientDB QCon Sao Paulo - March 26, 2015 Welcome to Big Data 90% of the data in the world today has been created in the
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Social Media Mining. Graph Essentials
Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition
Liferay Portal Performance Benchmark Study of Liferay Portal Enterprise Edition Table of Contents Executive Summary... 3 Test Scenarios... 4 Benchmark Configuration and Methodology... 5 Environment Configuration...
Rackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
STINGER: High Performance Data Structure for Streaming Graphs
STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on
Trafodion Operational SQL-on-Hadoop
Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Information Processing, Big Data, and the Cloud
Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April 2009. Page 1 of 12
XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines A.Zydroń 18 April 2009 Page 1 of 12 1. Introduction...3 2. XTM Database...4 3. JVM and Tomcat considerations...5 4. XTM Engine...5
http://support.oracle.com/
Oracle Primavera Contract Management 14.0 Sizing Guide October 2012 Legal Notices Oracle Primavera Oracle Primavera Contract Management 14.0 Sizing Guide Copyright 1997, 2012, Oracle and/or its affiliates.
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
SharePoint Server 2010 Capacity Management: Software Boundaries and Limits
SharePoint Server 2010 Capacity Management: Software Boundaries and s This document is provided as-is. Information and views expressed in this document, including URL and other Internet Web site references,
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Bigdata : Enabling the Semantic Web at Web Scale
Bigdata : Enabling the Semantic Web at Web Scale Presentation outline What is big data? Bigdata Architecture Bigdata RDF Database Performance Roadmap What is big data? Big data is a new way of thinking
Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model. An Oracle Technical White Paper May 2005
Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model An Oracle Technical White Paper May 2005 Building GIS Applications Using the Oracle Spatial Network Data Model
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
Distance Degree Sequences for Network Analysis
Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
