A Brief Study of Open Source Graph Databases
|
|
|
- Homer Matthews
- 10 years ago
- Views:
Transcription
1 A Brief Study of Open Source Graph Databases Rob McColl David Ediger Jason Poovey Dan Campbell David Bader Georgia Tech Research Institute, Georgia Institute of Technology Abstract With the proliferation of large irregular sparse relational datasets, new storage and analysis platforms have arisen to fill gaps in performance and capability left by conventional approaches built on traditional database technologies and query languages. Many of these platforms apply graph structures and analysis techniques to enable users to ingest, update, query and compute on the topological structure of these relationships represented as set(s) of edges between set(s) of vertices. To store and process Facebook-scale datasets, they must be able to support data sources with billions of edges, update rates of millions of updates per second, and complex analysis kernels. These platforms must provide intuitive interfaces that enable graph experts and novice programmers to write implementations of common graph algorithms. In this paper, we explore a variety of graph analysis and storage platforms. We compare their capabilities, interfaces, and performance by implementing and computing a set of real-world graph algorithms on synthetic graphs with up to 256 million edges. In the spirit of full disclosure, several authors are affiliated with the development of STINGER. Background In the context of this paper, the term graph database is used to refer to any storage system that can contain, represent, and query a graph consisting of a set of vertices and a set of edges relating pairs of vertices. This broad definition encompasses many technologies. Specifically, we examine schemas within traditional disk-backed, ACID-compliant SQL databases (SQLite, MySQL, Oracle, and MS SQL Server), modern NoSQL databases and graph databases (Neo4j, OrientDB, InfoGrid, Titan, FlockDB, ArangoDB, InfiniteGraph, AllegroGraph, DEX, GraphBase, and HyperGraphDB), distributed graph processing toolkits based on MapReduce, HDFS, and custom BSP engines (Bagel, Hama, Giraph, PEGASUS, Faunus), and in-memory graph packages designed for massive shared-memory (NetworkX, Gephi, MTGL, Boost, urika, and STINGER). For all of these, we build a matrix containing the package maintainer, license, platform, implementation language(s), features, cost, transactional capabilities, memory vs. disk storage, single-node vs. distributed, text-based query language support, built-in algorithm support, and primary traversal and query styles supported. An abridged version of this matrix is presented in Appendix A. For a selected group of primarily open source graph databases, we implemented and tested a series of representative common graph kernels on each technology in the set on the same set of synthetic graphs modeled on social networks. Previous work has compared Neo4j and MySQL using simple queries and breadth-first search [7], and we are inspired by a combination objective/subjective approach to the report. Graph primitives for RDF query languages were extensively studied in [] and data models for graph databases in [2], which are beyond the scope of this study. Algorithms and Approach Four common graph kernels were selected with requirements placed on how each must be implemented to emphasize common implementation styles of graph algorithms. The first is the Single Source Shortest Paths (SSSP) problem, which was specifically required to be implemented as a level-synchronous parallel breadthfirst traversal of the graph. The second is an implementation of the Shiloach-Vishkin connected components algorithm (SV) [6] in the edge-parallel label-pushing style. The third is PageRank (PR) [4] in the vertexparallel Bulk Synchronous Parallel (BSP) power-iteration style. The last is a series of edge insertions and deletions in parallel to represent random access and modification of the structure. When it was not possible to meet these requirements due to restrictions of the framework, the algorithms were implemented in as close of a manner as possible. When the framework included an implementation of any of these algorithms, that implementation was used if it met the requirements. The implementations were intentionally written in a straightforward manner with no manual tuning or optimization to emulate non-hero programmers. Many real-world networks demonstrate scale-free characteristics [3, 8]. Four initial sparse static graphs
2 and sets of edge insertions and deletions were created using an R-MAT [5] generator with parameters A = 0.55, B = 0., C = 0., and D = 0.25). These graphs contain K (tiny), 32K (small), M (medium), and 6M (large) vertices with 8K, 256K, 8M, and 256M undirected edges, respectively. Graphs over B edges were considered, but it was determined that many packages could not handle that scale. Edge updates have a 6.25% probability of being deletions uniformly selected from existing edges where deletions are handled in the default manner for the package. For the tiny and small graphs, K updates were performed, and M updates were performed otherwise. The quantitative measurements taken were initial graph construction time, total memory in use by the program upon completion, update rate in edges per second, and time to completion for SSSP, SV, and PR. Tests for single-node platforms and databases were run on a system with four AMD Opteron 6282 SE processors and 256GB of DDR3 RAM. Distributed system tests were performed on a cluster of 2 systems with minimum configuration of two Intel Xeon X5660 processors with 24GB DDR3 RAM connected by QDR Infiniband. HDFS was run in-memory. Results are presented in full in Appendix B. Packages for which no result is provided either ran out of memory or simply did not complete in a comparable time. The code used for testing is available at github.com/robmccoll/graphdb-testing. 2 Experiences 2. User Experience A common goal of open source software is widespread adoption. Adoption promotes community involvement, and an active community is likely to contribute use cases, patches, and new features. When a user downloads a package, the goal should be to build as quickly as possible with minimal input from the user. A number of packages tested during the course of this study did not build on the first try. Often, this was the result of missing libraries or test modules that do not pass, and was resolved with help from the forums. Build problems may reduce the likelihood that users continue with a project. Among both commercial and open source software packages, we find a lack of consistency in approaches to documentation. It is often difficult to quickly get started. System requirements and library dependencies should be clear. Quickstart guides are appreciated. We find that most packages lack adequate usage examples. Common examples illustrate how to create a graph and query the existence of vertices and edges. Examples are needed that show how to compute graph algorithms, such as PageRank or connected components, using the provided primitives. The ideal example demonstrates data ingest, a real analysis workflow that reveals knowledge, and the interface that an application would use to conduct this analysis. While not unique to graphs, there are a multitude input file formats that are supported by the software packages in this study. Formats can be built on XML, JSON, CSV, or other proprietary binary and plaintext formats. Each has a trade-off between size, descriptiveness, and flexibility. The number of different formats creates a challenge for data interchange. Formats that are edge-based, delimited, and self-describing can easily be parsed in parallel and translated between each other. 2.2 Developer Experience Object-Oriented Programming (OOP) was introduced to enforce modularity, increase reusability, and add abstraction to enable generality. These are important goals; however, OOP can also inadvertently obfuscate logic and create bloated code. For example, considering a basic PageRank computation in Giraph, the function implementing the algorithm uses approximately 6% of the 275 lines of code. The remainder of the code registers various callbacks and sets up the working environment. Although the extra code provides flexibility, it can be argued that much is boilerplate. The framework should support the programmer spending as little time as possible writing boilerplate and configuration so that the majority of code is a clear and concise implementation of the algorithm. A number of graph databases retain many of the properties of a relational database, representing the graph edges as a key-value store without a schema. A variety of query languages are employed. Like their relational cousins, these databases are ACID-compliant and disk-backed. Some ship with graph algorithms implemented natively. While these query languages may be comfortable to users coming from a database 2
3 background, they are rarely sufficient to implement even the most basic of graph algorithms succinctly. The better query languages are closer to full scripting environments, which may indicate that query languages are not sufficient for graph analysis. In-memory graph databases focus more on the algorithms and rarely provide a query language. Access to the graph is done through algorithms or graph-centric APIs. This affords the user the ability to write nearly any algorithm, but presents a certain complexity and learning curve. The visitor pattern in MTGL is a strong example of a graph-centric API. The BFS visitor allows the user to register callbacks for newly visited vertices, newly discovered BFS tree edges, and edges to previously discovered vertices. Given that breadth-first traversals are frequently used as building blocks in common graph algorithms, providing an API that performs this traversal over the graph in an optimal way can abstract the specifics of the data structure while giving performance benefit to novice users. Similar primitives might include component labelings, matchings, independent sets, and others (many of which are provided by NetworkX). Additionally, listening mechanisms for dynamic algorithms should be considered in the future. 2.3 Graph-specific Concerns An important measurement of performance is the size of memory consumed by the application processing a graph of interest. An appropriate measurement of efficiency is the number of bytes of memory required to store each edge of the graph. The metadata associated with vertices and edges can vary by software package. The largest test graph contained more than 28 million edges, or about one gigabyte in 4-byte tuple form. Graph databases were run on machines with 256GB of memory. However, a number of software packages could not run on the largest test graph. It is often possible to store large graphs on disk, but the working set size of the algorithm in the midst of the computation can overwhelm the system. For example, a breadth-first search will often contain an iteration in which the size of the frontier is as many as half of the vertices in the graph. We find that some graph databases can store these graphs on disk, but cannot compute on them because the in-memory portion of the computational is too large. We find differing semantics regarding fundamental operations on graphs. For example, when an edge inserted already exists, there are several possibilities: ) do nothing, 2) insert another copy of that edge, 3) increment the edge weight of that edge, and 4) replace the existing edge. Graph databases may support one or more of these semantics. The same can be said for edge deletion and the existence of self-edges. A common query access pattern among graph databases is the extraction of a subgraph, egonet, or neighborhood around a vertex. While answering some questions, a more flexible interface allows random access to the vertices and edges, as well as graph traversal. Graph traversal is a fundamental component to many graph algorithms. In some cases, random access to graph edges enables cleaner implementation of an algorithm with better workload balance or communication properties. 3 Performance We compute four algorithms (single source shortest path, connected components, PageRank, and update) for each of four graphs (tiny, small, medium, and large) using each graph package. Developer-provided implementations were used when available. Otherwise, implementations were created using the best algorithms, although no heroic programming or optimization were done. In general, we find up to five orders of magnitude difference in execution time. The time to compute PageRank on a graph with 256K edges is shown in Figure. For additional results, please refer to Appendix B. 4 Position We believe there is great interest in graph-based approaches for big data, and many open source efforts are under way. The goal of each of these efforts is to extract knowledge from the data in a more flexible manner than a relational database. Many approaches to graph databases build upon and leverage the long history of relational database research. 3
4 Page Rank - Small ( V : 32K, E : 256K) 0k stinger boost sqlite mtgl giraph networkx orientdb dex bagel titan neo4j pegasus Figure : The time taken to compute PageRank on the small graph (32K vertices and 256K undirected edges) for each graph analysis package. Our position is that graph databases must become more graph aware in data structure and query language. The ideal graph database must understand analytic queries that go beyond neighborhood selection. In relational databases, the index represents pre-determined knowledge of the structure of the computation without knowing the specific input parameters. A relational query selects a subset of the data and joins it by known fields. The index accelerates the query by effectively pre-computing a portion of the query. In a graph database, the equivalent index is the portion of an analytic or graph algorithm that can be pre-computed or kept updated regardless of the input parameters of the query. Examples include the connected components of the graph, algorithms based on spanning trees, and vertex statistics like PageRank or clustering coefficients. A sample query might ask for shortest paths vertices, or alternatively the top k vertices in the graph according to PageRank. Rather than compute the answer on demand, maintaining the analytic online results in lower latency responses. While indices over the properties of vertices may be convenient for certain cases, these types of queries are served equally well by traditional SQL databases. While many software applications are backed by databases, most end users are unaware of the SQL interface between application and data. SQL is not itself an application. Likewise, we expect that NoSQL is not itself an application. Software will be built atop NoSQL interfaces that will be hidden from the user. The successful NoSQL framework will be the one that enables algorithm and application developers to implement their ideas easily while maintaining high levels of performance. Visualization frameworks are often incorporated into graph databases. The result of a query is returned as a visualization of the extracted subgraph. State-of-the-art network visualization techniques rarely scale to more than one thousand vertices. We believe that relying on visualization for query output limits the types of queries that can be asked. Queries based on temporal and semantic changes can be difficult to capture in a force-based layout. Alternative strategies include visualizing statistics of the output of the algorithms over time, rather than the topological structure of the graph. With regard to the performance of disk-backed databases, transactional guarantees may unnecessarily reduce performance. It could be argued that such guarantees are not necessary in the average case, especially atomicity and durability. For example, if the goal is to capture and analyze a stream of noisy data from Twitter, it may be acceptable for an edge connecting two users to briefly appear in one direction and not the other. Similarly, in the event of a power failure, the loss of a few seconds of data may not significantly change which vertices have the highest centrality on the graph. Some of the databases presented here seem to have reached this realization and have made transactional guarantees optional. 4
5 References [] Renzo Angles and Claudio Gutierrez. Querying RDF data from a graph database perspective. In The Semantic Web: Research and Applications, volume 3532 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [2] Renzo Angles and Claudio Gutierrez. Survey of graph database models. ACM Comput. Surv., 40():: :39, February [3] Albert-László Barabási and Réka Albert. Emergence of Scaling in Random Networks. Science, 286(5439):509 52, October 999. [4] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30( 7):07 7, 998. Proceedings of the Seventh International World Wide Web Conference. [5] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proc. 4th SIAM Intl. Conf. on Data Mining (SDM), Orlando, FL, April SIAM. [6] Y. Shiloach and U. Vishkin. An O(log n) parallel connectivity algorithm. J. Algs., 3():57 67, 982. [7] Chad Vicknair, Michael Macias, Zhendong Zhao, Xiaofei Nan, Yixin Chen, and Dawn Wilkins. A comparison of a graph database and a relational database: a data provenance perspective. In Proceedings of the 48th Annual Southeast Regional Conference, pages 42: 42:6, 200. [8] D.J. Watts and S.H. Strogatz. Collective dynamics of small world networks. Nature, 393: ,
6 A Graph Databases MySQL Oracle GPL/Proprietary x86 C/C++ Bin Free X - X X - - SQL X X X SQLDB Oracle Oracle Proprietary x86 C/C++ Bin $80- $950 X - X X X - SQL X X X SQLDB SQL Server Microsoft Proprietary x86- Win C++ Bin $898- $8592 X - X X - - SQL - X X SQLDB SQLite Richard Hipp Public Domain x86 C Src/Bin Free X X X X - - SQL X X X SQLDB AllegroGraph Franz, Inc. Proprietary x86 Likely Java Bin Free- ish/$$$$ X - X X - X SPARQL,RDFS++,Prolog - X X GDB ArangoDB ArangoDB Apache x86 C/C++/JS Src/Bin Free - - X X - - AQL - X X GDB/KV/DOC DEX Sparsity- Technologies Proprietary x86 C++ Bin Free Personal/Commercial $$ X - X X - X Traversal X - X GDB FlockDB Twitter Apache x86 Java,Scala,Ruby Src Free - - X X X X X GDB GraphBase FactNexus Proprietary x86 Java Bin Free,$5/mo,$20,000? - X X - - Bounds X X X GDB HyperGraphDB Kobrix Software LGPL x86 Java Src Free MVCC X X X X - HGQuery,Traversal X - - HyperGDB InfiniteGraph Objectivity Proprietary x86 Java/C++ Bin Free Trial/$5,000 Both - X X X - Gremlin X X X GDB InfoGrid Johannes Ernst AGPL/Proprietary x86 Java Src/Bin Free + Support - X X X X - - X X - GDB Neo4j Neo Technology GPL/Proprietary x86 Java Src/Bin Free,$6,000- $24,000 X - X X - X Cypher X X X GDB/NoSQL OrientDB NuvolaBase Ltd Apache x86 Java Src/Bin Free + Support Both X X X X - Extended SQL, Gremlin X X X GDB/NoSQL Titan Aurelius Apache x86 Java Src/Bin Free + Support Both - X X X - Gremlin X X - GDB Bagel UC Berkley BSD x86 Java/Scala/Spark Src Free - X - X X X - X - - BSP BGL Boost / IU Boost x86 / C++ C++ Src/Bin Free - X - X - X - X - - Library Faunus Aurelius Apache x86 Java Src Free + Support Both - X X X - Gremlin X X - Hadoop Gephi Gephi Consortium GPL/CDDL x86 Java,OpenGL Src/Bin Free - X - X - X - X X - Toolkit Giraph Apache Apache x86 Java Src Free - X - X X X - X - - BSP GraphStream University Le Havre LGPL/CeCILL- C x86 Java Src/Bin Free - X - X - X - X - - Library Hama Apache Apache x86 Java Src Free - X - X X X - X - - BSP MTGL Sandia NL BSD XMT C++ Src Free - X - X - X - X - - Library NetworkX Los Alamos NL BSD x86 Python Src/Bin Free - X - X - X - X - - Library PEGASUS CMU Apache x86 Java Src/Bin Free - - X X X X - - X - Hadoop STINGER GT / GTRI BSD x86*/xmt C Src Free - X - X - X - X X X Library urika Yarc Data / Cray Proprietary XMT Likely C++ Bin $$$$? X - X - - SPARQL - X X Appliance Owner / Maintainer License Platform Language Distribution Cost Transactional Memory- Based Disk- Based Single- Node Distributed Graph Algorithms Text- Based Query Language Embeddable Software Data Store Type 6
7 B Performance Results B. Tiny Graph 0 Initial Graph Construction - Tiny ( V : K, E : 8K) pegasus- NA boost mtgl stinger networkx dex bagel sqlite giraph titan neo4j orientdb Figure 2: The time taken to construct the tiny graph (K vertices and 8K undirected edges) for each graph analysis package. 0k Single Source Shortest Path - Tiny ( V : K, E : 8K) stinger boost mtgl networkx dex neo4j titan orientdb sqlite giraph bagel pegasus Figure 3: The time taken to compute SSSP from vertex 0 in the tiny graph (K vertices and 8K undirected edges) for each graph analysis package. 7
8 Connected Components - Tiny ( V : K, E : 8K) stinger boost networkx mtgl dex orientdb titan giraph neo4j bagel sqlite pegasus Figure 4: The time taken to label the connected components of the tiny graph (K vertices and 8K undirected edges) for each graph analysis package. 0k Page Rank - Tiny ( V : K, E : 8K) stinger boost mtgl networkx orientdb dex sqlite titan giraph bagel neo4j pegasus Figure 5: The time taken to compute the PageRank of each vertex in the tiny graph (K vertices and 8K undirected edges) for each graph analysis package. 8
9 0M Update Rate - Tiny ( V : K, E : 8K) M Edges per Second k 0k orientdb titan neo4j pegasus sqlite dex bagel giraph networkx boost stinger mtglinsertonly Figure 6: The update rate in edges inserted/deleted per second when applying,000 edge updates to the tiny graph (K vertices and 8K undirected edges) for each graph analysis package. 0M Memory Usage - Tiny ( V : K, E : 8K) M Memory Usage (KB) k 0k giraph- NA pegasus- NA dex sqlite stinger mtgl boost networkx titan bagel orientdb neo4j Figure 7: The memory occupied by the program after performing all operations on the tiny graph (K vertices and 8K undirected edges) for each graph analysis package. 9
10 B.2 Small Graph Initial Graph Construction - Small ( V : 32K, E : 256K) pegasus- NA stinger boost mtgl sqlite bagel giraph networkx titan dex neo4j orientdb Figure 8: The time taken to construct the small graph (32K vertices and 256K undirected edges) for each graph analysis package. Single Source Shortest Path - Small ( V : 32K, E : 256K) boost stinger mtgl networkx sqlite titan neo4j dex orientdb giraph bagel pegasus Figure 9: The time taken to compute SSSP from vertex 0 in the small graph (32K vertices and 256K undirected edges) for each graph analysis package. 0
11 Connected Components - Small ( V : 32K, E : 256K) stinger mtgl boost networkx giraph neo4j dex titan orientdb bagel sqlite pegasus Figure 0: The time taken to label the connected components of the small graph (32K vertices and 256K undirected edges) for each graph analysis package. Page Rank - Small ( V : 32K, E : 256K) 0k stinger boost sqlite mtgl giraph networkx orientdb dex bagel titan neo4j pegasus Figure : The time taken to compute the PageRank of each vertex in the small graph (32K vertices and 256K undirected edges) for each graph analysis package.
12 0M Update Rate - Small ( V : 32K, E : 256K) M Edges per Second k 0k titan orientdb neo4j pegasus sqlite dex bagel giraph networkx mtglinsertonly boost stinger Figure 2: The update rate in edges inserted/deleted per second when applying,000 edge updates to the small graph (32K vertices and 256K undirected edges) for each graph analysis package. M Memory Usage - Small ( V : 32K, E : 256K) 0M Memory Usage (KB) M k k 9500 giraph- NA pegasus- NA sqlite dex stinger mtgl boost bagel titan networkx neo4j orientdb Figure 3: The memory occupied by the program after performing all operations on the small graph (32K vertices and 256K undirected edges) for each graph analysis package. 2
13 B.3 Medium Graph 0k Initial Graph Construction - Medium ( V : M, E : 8M) pegasus- NA stinger bagel giraph boost mtgl networkx titan orientdb Figure 4: The time taken to construct the medium graph (M vertices and 8M undirected edges) for each graph analysis package. Single Source Shortest Path - Medium ( V : M, E : 8M) stinger boost mtgl giraph networkx titan bagel orientdb pegasus Figure 5: The time taken to compute SSSP from vertex 0 in the medium graph (M vertices and 8M undirected edges) for each graph analysis package. 3
14 Connected Components - Medium ( V : M, E : 8M) stinger mtgl boost giraph networkx titan bagel orientdb pegasus Figure 6: The time taken to label the connected components of the medium graph (M vertices and 8M undirected edges) for each graph analysis package. M Page Rank - Medium ( V : M, E : 8M) k k stinger boost giraph mtgl networkx orientdb pegasus bagel titan Figure 7: The time taken to compute the PageRank of each vertex in the medium graph (M vertices and 8M undirected edges) for each graph analysis package. 4
15 0M Update Rate - Medium ( V : M, E : 8M) M Edges per Second k 0k orientdb titan bagel pegasus networkx boost giraph mtglinsertonly stinger Figure 8: The update rate in edges inserted/deleted per second when applying,000 edge updates to the medium graph (M vertices and 8M undirected edges) for each graph analysis package. Memory Usage - Medium ( V : M, E : 8M) 40M M Memory Usage (KB) 0M 4M M M giraph-na pegasus- NA stinger boost bagel mtgl titan networkx orientdb Figure 9: The memory occupied by the program after performing all operations on the medium graph (M vertices and 8M undirected edges) for each graph analysis package. 5
16 B.4 Large Graph Initial Graph Construction - Large ( V : 6M, E : 28M) pegasus-na stinger giraph boost mtgl Figure 20: The time taken to construct the large graph (6M vertices and 28M undirected edges) for each graph analysis package. 0k Single Source Shortest Path - Large ( V : 6M, E : 28M) stinger boost giraph mtgl pegasus Figure 2: The time taken to compute SSSP from vertex 0 in the large graph (6M vertices and 28M undirected edges) for each graph analysis package. 6
17 0k Connected Components - Large ( V : 6M, E : 28M) stinger mtgl giraph boost pegasus Figure 22: The time taken to label the connected components of the large graph (6M vertices and 28M undirected edges) for each graph analysis package. 20k 0k Page Rank - Large ( V : 6M, E : 28M) k k stinger boost giraph pegasus mtgl Figure 23: The time taken to compute the PageRank of each vertex in the large graph (6M vertices and 28M undirected edges) for each graph analysis package. 7
18 0M Update Rate - Large ( V : 6M, E : 28M) M Edges per Second k 0k pegasus mtgl-insertonly giraph boost stinger Figure 24: The update rate in edges inserted/deleted per second when applying,000 edge updates to the large graph (6M vertices and 28M undirected edges) for each graph analysis package. 34M Memory Usage - Large ( V : 6M, E : 28M) M Memory Usage (KB) 30M 28M M M giraph-na pegasus-na stinger boost mtgl Figure 25: The memory occupied by the program after performing all operations on the large graph (6M vertices and 28M undirected edges) for each graph analysis package. 8
A Performance Evaluation of Open Source Graph Databases
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Georgia Institute of Technology Abstract With the proliferation of large, irregular,
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Review of Graph Databases for Big Data Dynamic Entity Scoring
Review of Graph Databases for Big Data Dynamic Entity Scoring M. X. Labute, M. J. Dombroski May 16, 2014 Disclaimer This document was prepared as an account of work sponsored by an agency of the United
STINGER: High Performance Data Structure for Streaming Graphs
STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1
GRAPH DATABASE SYSTEMS h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1 Use Case: Route Finding Source: Neo Technology, Inc. h_da Prof. Dr. Uta Störl Big Data Technologies:
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
Objectivity positions graph database as relational complement to InfiniteGraph 3.0
Objectivity positions graph database as relational complement to InfiniteGraph 3.0 Analyst: Matt Aslett 1 Oct, 2012 Objectivity Inc has launched version 3.0 of its InfiniteGraph graph database, improving
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
Review: Graph Databases
Volume 4, Issue 5, May 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Pallavi Madan M.Tech,Dept
An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013
U.S. National Security Agency Research Directorate - R6 Technical Report NSA-RD-2013-056002v1 May 20, 2013 Graphs are everywhere! A graph is a collection of binary relationships, i.e. networks of pairwise
Cray: Enabling Real-Time Discovery in Big Data
Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
The Current State of Graph Databases
The Current State of Graph Databases Mike Buerli Department of Computer Science Cal Poly San Luis Obispo [email protected] December 2012 Abstract Graph Database Models is increasingly a topic of interest
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
A Comparison of Current Graph Database Models
A Comparison of Current Graph Database Models Renzo Angles Universidad de Talca (Chile) 3rd Int. Workshop on Graph Data Management: Techniques and applications (GDM 2012) 5 April, Washington DC, USA Outline
1-Oct 2015, Bilbao, Spain. Towards Semantic Network Models via Graph Databases for SDN Applications
1-Oct 2015, Bilbao, Spain Towards Semantic Network Models via Graph Databases for SDN Applications Agenda Introduction Goals Related Work Proposal Experimental Evaluation and Results Conclusions and Future
A Practical Approach to Process Streaming Data using Graph Database
A Practical Approach to Process Streaming Data using Graph Database Mukul Sharma Research Scholar Department of Computer Science & Engineering SBCET, Jaipur, Rajasthan, India ABSTRACT In today s information
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Investigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein [email protected] Stephen Mcdonald [email protected] Benjamin Lewis [email protected] Weihe
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
An empirical comparison of graph databases
An empirical comparison of graph databases Salim Jouili Eura Nova R&D 1435 Mont-Saint-Guibert, Belgium Email: [email protected] Valentin Vansteenberghe Universite Catholique de Louvain 1348 Louvain-La-Neuve,
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Oracle Big Data Spatial and Graph
Oracle Big Data Spatial and Graph Oracle Big Data Spatial and Graph offers a set of analytic services and data models that support Big Data workloads on Apache Hadoop and NoSQL database technologies. For
Big Data Graphs and Apache TinkerPop 3. David Robinson, Software Engineer April 14, 2015
Big Data Graphs and Apache TinkerPop 3 David Robinson, Software Engineer April 14, 2015 How This Talk Is Structured Part 1 Graph Background and Using Graphs Part 2 Apache TinkerPop 3 Part 1 builds a foundation
Microblogging Queries on Graph Databases: An Introspection
Microblogging Queries on Graph Databases: An Introspection ABSTRACT Oshini Goonetilleke RMIT University, Australia [email protected] Timos Sellis RMIT University, Australia [email protected]
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Cloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
How graph databases started the multi-model revolution
How graph databases started the multi-model revolution Luca Garulli Author and CEO @OrientDB QCon Sao Paulo - March 26, 2015 Welcome to Big Data 90% of the data in the world today has been created in the
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc
Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
NoSQL and Graph Database
NoSQL and Graph Database Biswanath Dutta DRTC, Indian Statistical Institute 8th Mile Mysore Road R. V. College Post Bangalore 560059 International Conference on Big Data, Bangalore, 9-20 March 2015 Outlines
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3
MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3 Copyright 2012 Manning Publications For more information on this and other Manning titles go to www.manning.com brief contents PART
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
Domain driven design, NoSQL and multi-model databases
Domain driven design, NoSQL and multi-model databases Java Meetup New York, 10 November 2014 Max Neunhöffer www.arangodb.com Max Neunhöffer I am a mathematician Earlier life : Research in Computer Algebra
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World
COSC 304 Introduction to Systems Introduction Dr. Ramon Lawrence University of British Columbia Okanagan [email protected] What is a database? A database is a collection of logically related data for
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Big Data and Natural Language: Extracting Insight From Text
An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Large-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
Data-intensive HPC: opportunities and challenges. Patrick Valduriez
Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Oracle Big Data Building A Big Data Management System
Oracle Big Building A Big Management System Copyright 2015, Oracle and/or its affiliates. All rights reserved. Effi Psychogiou ECEMEA Big Product Director May, 2015 Safe Harbor Statement The following
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
FAWN - a Fast Array of Wimpy Nodes
University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed
How To Improve Performance In A Database
Some issues on Conceptual Modeling and NoSQL/Big Data Tok Wang Ling National University of Singapore 1 Database Models File system - field, record, fixed length record Hierarchical Model (IMS) - fixed
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Big Graph Data Management
[email protected] y d-b hel topic Antonio Maccioni email Big Data Course locatedin re whe 14 May 2015 is-a when affiliated Big Graph Data Management Rome this talk is about Graph Databases: models,
Scaling Up HBase, Hive, Pegasus
CSE 6242 A / CS 4803 DVA Mar 7, 2013 Scaling Up HBase, Hive, Pegasus Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics
FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics André Petermann 1,2, Martin Junghanns 1, Robert Müller 2 and Erhard Rahm 1 1 University of Leipzig {petermann,junghanns,rahm}@informatik.uni-leipzig.de
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Visualizing a Neo4j Graph Database with KeyLines
Visualizing a Neo4j Graph Database with KeyLines Introduction 2! What is a graph database? 2! What is Neo4j? 2! Why visualize Neo4j? 3! Visualization Architecture 4! Benefits of the KeyLines/Neo4j architecture
Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren
News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Map-Based Graph Analysis on MapReduce
2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Reference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
1 File Processing Systems
COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.
