Socrates. A System For Scalable Graph Analytics
|
|
|
- Malcolm Owen
- 10 years ago
- Views:
Transcription
1 Socrates A System For Scalable Graph Analytics C. Savkli, R.Carr, M. Chapman, B. Chee, D. Minch Johns Hopkins University, Applied Physics Laboratory Mail Stop MP6-N312 Laurel, MD USA [email protected] Abstract A distributed graph processing system that provides locality control, indexing, graph query, and parallel processing capabilities is presented. Keywords graph, distributed, semantic, query, analytics I. INTRODUCTION Graphs provide a flexible data structure that facilitates fusion of disparate data sets. The popularity of graphs has shown a steady growth with the development of internet, cyber, and social networks. While graphs provide a flexible data structure, processing and analysis of large graphs remains a challenging problem. Successful implementation of graph analytics revolves around several key considerations: rapid data ingest and retrieval, scalable storage, and parallel processing. In this paper we present a graph analytics platform that is particularly focused on facilitating analysis of large-scale attribute rich graphs. Recently, NoSQL systems such as Hadoop have become popular for storing big data; however these systems face several fundamental challenges that make analyzing that data difficult: Lack of secondary indexing, which leads to poor performance of attribute queries (e.g. What flights have we seen moving faster than 500mph? ). Lack of locality control, which can lead to unnecessary movement of data. Lack of well-defined schema, which makes database maintenance challenging. More traditional relational databases (a.k.a. RDBMS) do not have these problems, but face their own set of challenges when dealing with big data: Table structures not flexible enough to support new kinds of data easily. Poor parallelization & scalability. We have developed a graph processing system called Socrates that is designed to facilitate the development of scalable analytics on graph data. Socrates combines the key features of these two approaches and avoids all of the above problems. It features several advances in data management for parallel computing and scalable distributed storage: Models data as a property graph, which supports representation of different kinds of data using a simple, extensible database schema. Implements the open-source Blueprints graph API 1, allowing analytics to be easily adapted to and from other Blueprints-enabled storage systems. Distributed storage & processing: Functionality to execute algorithms on each cluster node and merge results with minimal inter-node communication. Indexing: Every attribute is indexed for fast random access and analytics. Distributed management: Nothing in the cluster is centrally managed. This includes communication as well as locating graph elements. Locality control: Graph vertices can be placed on specific machines, a feature essential for minimizing data movement in graph analytics. Platform independence: Runs in the Java Virtual Machine. II. RELATED WORK Socrates is designed for attribute rich data having many labels on both nodes and edges. Twitter s Cassovary, FlockDB, and Pegasus are fast at processing structural graphs. However, they do not leverage or provide facilities to store or use labels on edges or nodes besides edge weights [1][2][3]. Current graph benchmarking tools such as HPC Scalable Graph Analysis Benchmark (HPC-SGAB) generate tuples of data with the form <StartVertex, EndVertex, Weight> with no other attributes, which plays well to the strengths of Cassovary or Pegasus [4]. However, such benchmarks tend to ignore the functionality we specifically aim for within this work. 1
2 Fig. 1: Graph structure representation on the cluster. Knowledge Discovery Toolbox (KDT) is a python toolbox built on Combinatorial BLAS, a linear algebra based library [5]. A linear algebra adjacency matrix approach is similar to Pegasus or GraphLab, however, KDT enables the use of attributes through filtering. Users create queries or filters on an input graph that generates a new adjacency matrix that is used by Combinatorial BLAS. However, this filtering process on large graphs can be expensive and lead to storage issues; KDT also requires that the adjacency matrix fit into distributed main memory [5]. Dynamic Distributed Dimensional Data Model (D4M) is a database and analytics platform built using Matlab built on various tuple stores such as Accumulo or HBase. The goal of D4M is to combine the advantages of distributed arrays, tuple stores, and multi-dimensional associative arrays [6]. D4M focuses on mathematical operations on associative arrays, which Graphs or adjacency matrices can be represented as. Socrates, however focuses primarily on property graphs and traditional edge or vertex operations. Other graph databases focus more on content or tend to model specific relationship types such as those in an ontology. These databases include Jena, OpenLink Virtuoso, R2DF and other commercial offerings that use Resource Description Framework (RDF), which was originally designed to represent metadata [4][7][8]. RDF expressions consist of triples (subject, predicate and object) that are stored and queried against. The predicate in each triple represents a relationship between a subject and object. Intuitively, a general set of RDF tuples can be considered a graph although RDF is not formally defined as a mathematical concept of a graph [9]. Neo4J, Titan, HypergraphDB and DEX offer similar capabilities to Socrates. Socrates, Neo4J and Titan make use of the Blueprints API for interacting with graphs [10]. However, Socrates has extended this API to offer enhanced graph processing functionality that includes locality control, additional graph methods facilitating graph analytics, as well as a parallel processing capability. Socrates is built upon a cluster of SQL databases, similar to Facebook s TAO and Twitter s deprecated FlockDB [11][3]. Facebook s social graph is currently stored in TAO, a data model and API specifically designed for social graphs. Facebook s social graph is served via TAO which was Fig. 2: The JGraph model of parallelism in Socrates. Users submit a processing job that is run in parallel on each node in the cluster, and given access to the subgraph stored on that node. designed to support its workload needs: frequent reads with fewer writes, edge queries mostly having empty results and node connectivity and data sizes having distributions with long tails. Facebook s workload needs enable efficient cache implementations; since a large portion of reads are potentially in cache (e.g., people are interested in current events time locality or alternatively a post becomes viral and is viewed by many people). TAO enables relatively few queries and stores its main data objects and associations as key-value pairs. Socrates stores attributes in a similar fashion, utilizing many tables in which each attribute is stored as value and the id (of a node or edge) as the key. This schema removes joins and associated memory and performance bottlenecks such as query planning or combinatorial explosions in the case of outer joins. TAO was developed with specific constraints in mind such as global scaling and time based social graphs, whereas Socrates is targeted at more general graph structures. III. DESIGN A. Data Management & Locality Figure 1 demonstrates Socrates storage schema on a small graph in a two-node cluster. Structural graph information is stored in two tables on each machine: a vertex table containing only the IDs of the vertices stored on the machine, and an edge table containing, for every edge connected to a vertex in the vertex table: an edge ID; the IDs of both connected vertices; and identifiers specifying which machine each connected vertex is stored on. Thus, each vertex is stored on only one machine, while each edge is stored on either one machine (such as E_4_1 or E_7_5 in Fig. 1) or two (such as E_1_7 in Fig. 1). Edges in Socrates are directed, but their direction can be ignored by undirected graph algorithms. For each attribute present on any vertex or outgoing edge on a machine, Socrates also stores a two-column table consisting of <id, val> tuples, where id is either a vertex or edge identifier and val is the value of the attribute. This allows each attribute to be independently indexed and queried, while avoiding the complexity typically associated with table structure changes in relational databases.
3 Person of Interest 1: - Brown hair - Weight between 120 and 140kg - Height between 150 and 170cm 1 Siblings Associates 2 3 Associates Fig. 3: The graph on the left represents one node of a four-machine cluster where data was archived without locality control. The vertices at the top with no edges have no neighbors on the local machine. In this case the probability that a neighbor resides on the same machine is ¼, which is reflected in the outcome where only about a quarter of vertices have their neighbors local. The graph on the right shows what a partial graph on the same machine looks like when archived using Socrates. The number of edges that connect to other machines is minimized, which enables more efficient computation of graph algorithms. Users can access the data in the cluster via an augmented version of the Blueprints graph API. Besides the standard Blueprints methods, Socrates graph API provides additional methods that take advantage of its architecture for efficient implementation of analytics. One such method supports locality control, which involves adding vertices of the graph to specific machines. This machine-level access to the graph allows a user to partition the graph to minimize edge crossings between machines. The example in Fig. 3, which is built using the Brightkite data set [13], illustrates the benefit of locality control. The ability to place graph nodes in particular machines can also be used to automatically partition the data when attributes can be hashed to generate a machine ID, such as partitioning of graph based on latitude-longitude attributes of vertices. Socrates also provides advanced query capabilities that have been efficiently implemented against our data model. For example, finding joint neighbors of a pair of vertices is a key operation for a variety of link discovery analysis. Socrates provides a function to do this with a single database query to the edge database on each machine (without any communication between cluster nodes), rather than iterating over neighbors of each vertex on the client. Another advanced query capability Socrates provides is a function to query for all instances of a small sub-graph 2, which can include structural constraints (i.e., which vertices connect and which do not) as well as semantic constraints (i.e., constraints on vertex and edge attributes). See Fig. 4 for an example. This capability is based on an implementation of the VF2 algorithm [14], adapted to take advantage of Socrates data model and the JGraph model of parallelism described in the next section. 2 The maximum size of the sub-graph query varies depending on the strength of its constraints and the complexity of the underlying data Socrates search algorithm aggressively prunes its search space, so generally if a query is too slow it is because the graph contains too many matches. P.O.I. 2: - Brown hair - Weight between 90 and 110kg - Height between 190 and 210cm P.O.I. 3: - No beard - Weight between 80 and 100kg - Height between 185 and 200cm Fig. 4: An example of Socrates sub-graph query capability. The left graph is the query graph, which contains both structural and semantic constraints. The right graph is the underlying data, with the instance of the query graph highlighted. In addition to parallelized implementations of useful structural graph operations, Socrates also takes advantage of underlying SQL severs to efficiently perform various attribute operations in each cluster node without having to move data to the client. B. Parallel Processing A key feature for handling large-scale graphs is the ability to process the graph in parallel without having to move data. A primary challenge in parallel processing of graphs is that, for most nontrivial problems, analysis on each machine on the cluster requires access to data on other machines to be able to produce a result. This is an area where ability to control partitioning of the graph over the cluster becomes critical. Another challenge in implementing parallelized graph analytics is designing an algorithm that takes advantage of distributed hardware wherever possible. Rather than limit users to a single parallel processing model, Socrates supports three models of parallel processing over the distributed graph: DGraph, JGraph, and Neighborhood. Each type provides a different tradeoff between ease of algorithm implementation, parallelism of client code, and network communication. This allows users to select the model which best suits the needs of their algorithm. DGraph: In the first model, Socrates provides clients with access to the DGraph class, which implements the Blueprints API and abstracts away the distributed nature of the underlying graph. Methods of the DGraph class are implemented with parallel calls to the underlying database where possible, but all results are sent back to the client machine and no client code runs on the Socrates cluster. This model of parallelism is suitable for developing analytics that need a global view of the graph or make only simple queries, such as find me all vertices with attribute last_name equal to Smith. JGraph: In the second model, Socrates allows clients to create processing jobs that can be submitted to the cluster to run in parallel on each node; each job is given access to the JGraph local to the node it is being run on (See Fig. 2). A JGraph is another implementation of the Blueprints API that represents the partial graph stored on its local machine. Vertex
4 Fig. 6: Socrates insertion speeds (on a logarithmic scale) for an E-R graph with 10 million vertices and 100 million edges, with varying cluster sizes. On our cluster, Socrates has exhibited approximately linear ingestion speed-up as nodes are added. Fig. 5: Socrates insertion speeds for an ER graph with an average of 10 edges per vertex, with varying cluster sizes. iterators used in parallel jobs on JGraphs iterate over vertices that are local in that machine. However any questions asked about local vertices, such as getneighbors() operation, retrieves all matching results independent of where they are located. Therefore the implementation of parallel jobs is quite similar to a regular standalone program. User has a clear view of the boundaries of the local graph and can limit operations on the local graph based on this information. This model of parallelism is suitable for developing analytics that can make use of a wide view of the graph as well as benefit from parallelism, such as sub-graph isomorphism (Fig. 4). It is also useful if the graph can be partitioned into disjoint sub-graphs that are small enough to fit on one machine; in this case, it is trivial to use Socrates locality control features to make sure the entire sub-graphs are placed on the same machine, allowing any algorithm that runs on a DGraph in the previous model to be parallelized. Neighborhood: The final model of parallelism supported by Socrates is intended for algorithms that perform local computation only, such as centrality measurements or PageRank. Socrates provides an interface that allows clients to define a function that will be run in batch on every vertex in the graph. When the function is called, its input is a TinkerGraph (an in-memory implementation of Blueprints) that contains one vertex labeled root, and may contain other elements that the client specifies when the function processing job is submitted. The client is able to specify whether the TinkerGraph should contain the root vertex s immediate neighbors (or in/out neighbors only in the case of a directed graph) and their edges with the root, as well as any properties of vertices or edges that should be fetched. The client s function is then able to write out new property values for the root node or any of its neighboring edges (future implementations will allow new neighboring vertices and edges to be added as well). This model of parallelism is intended to make it easy to write local-computation graph analytics that take full advantage of the computing power of a cluster. Under the hood, Socrates takes care of running the client function in parallel on each node using as many threads as the hardware on that node will support, optimizing SQL queries and inserts, caching frequently-used values, and minimizing network communication between nodes. Communication for parallel processing is provided by Java Messaging Service (JMS) using publish-subscribe method. In order to eliminate centralized communication and potential bottlenecks, each machine operates its own message broker. Every machine on the cluster is therefore on an equal footing. Parallelizing message handling also eliminates a potential single point of failure in the system. Jobs are executed in parallel on each machine and the results are, optionally, returned back to the client submitting the job request. IV. PERFORMANCE In this section, we provide preliminary performance benchmarks demonstrating the scalability of Socrates in terms of ingest and parallelized analytics. A. Cluster configuration Our cluster consists of 16 servers which are equipped with quad-core Intel Xeon E GHz processors, 64 GB 1600MHz DDR3 RAM, and two 4.0TB Seagate Constellation HDDs in RAID 0. The servers are running CentOS, and Socrates is using MySQL 5.5 with the TokuDB storage engine as its data store.
5 Fig. 7: Average processing speeds for an iteration of the connected component algorithm on an ER graph with an average of 10 edges per vertex. Fig. 8: Average processing speeds (on a logarithmic scale) for an E-R graph with 10 million vertices and 100 million edges, with varying cluster sizes. Again, we see approximately linear processing speed up as nodes are added. B. Ingest To measure our ingest speeds, we insert large randomlygenerated Erdos-Renyi (E-R) graphs 3 into the Socrates cluster from an external machine. The E-R graphs we use here consist of 100-vertex connected components with an average of 1000 edges each, however the ingest speed of these graphs depends only on the number of vertices and edges, and not on the underlying structure of the graph. We insert E-R graphs with a total size varying from 100,000 vertices and one million edges to 100 million vertices and one billion edges. In order to see how well our ingest speed scales as the size of a cluster grows, we have repeated the ingest benchmarks using only 2, 4, and 8 nodes in addition to the full 16. Fig. 5 shows the ingest speeds, expressed as elements inserted per second, for graphs and clusters of varying sizes. We can see that, although the 8- and 16-node cluster does not have time to reach its top speed when ingesting the smallest graph tested (which takes them 18 and 13 seconds, espectively), ingest speeds hold steady even as the input graphs grow to over a billion elements. Part of the reason for this is our use of TokuDB 4 as the storage engine for MySQL. Previous versions of Socrates used the standard InnoDB engine, and suffered from severe I/O bottlenecks when the graph grew above roughly 200 million edges, likely due to InnoDB no longer being able to fit the interior nodes of the B-Tree in its in-memory buffer pool. TokuDB uses cache-oblivious lookahead arrays in place of B-Trees, and seems to avoid this problem [12]. Fig. 6 shows the ingest speeds on a logarithmic scale for varying cluster sizes ingesting an E-R graph with 10 million vertices and 100 million edges We can see that Socrates ingest speed scales linearly to at least 16 nodes. C. Parallel Processing To measure our parallel processing capability, we use a naive connected component algorithm implemented in the Neighborhood parallelism model described above. On its initial iteration, the algorithm assigns each vertex a component attribute equal to the smallest vertex id among itself and its neighbors. On subsequent iterations, the algorithm examines the component attribute of itself and its neighbors, and updates its component to be the smallest value in the examined set. The algorithm terminates when no vertex s component changes in an iteration. Not that this benchmark is not necessarily the fastest method for computing connected components, however it is useful as a benchmark because Socrates must fetch each vertex s in and out neighbors along with property information for each node. Thus, the speed at which an iteration of this algorithm runs can give us an idea of the batch processing capability of the Neighborhood parallelism model. We have run this algorithm on the same graphs that were ingested in the previous experiment, i.e., Erdos-Renyi graphs consisting of connected components with 100 vertices and an average of 1000 edges each, varying in total size from 1.1 million to 1.1 billion elements. Once again, we have repeated the experiments using only 2, 4, and 8 nodes in our cluster in addition to the full 16. Figs. 7 and 8 present our results of these experiments. The number of vertices processed per second, averaged over all iterations of the algorithm after the initial iteration, is on the y-axis. Note that processing a single vertex involves fetching its immediate neighborhood (an average of 10 edges and vertices), as well as the component property for each vertex in the neighborhood.
6 These results are qualitatively similar to the ingest results. Fig. 7 demonstrates that, once the graph is large enough to allow the 8- and 16-node clusters to reach full speed, processing speed holds steady up to graphs of over a billion elements. Fig. 8 demonstrates that processing speeds also scale in an approximately linear fashion up to a cluster of 16 nodes. ACKNOWLEDGMENT Visualizations in the paper were produced using Pointillist, a graph visualization software developed by Jonathan Cohen. We acknowledge D. Silberberg and L. DiStefano for providing feedback and support during the development of Socrates. REFERENCES [1] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh, WTF: The Who to Follow Service at Twitter, in Proceedings of the 22Nd International Conference on World Wide Web, Republic and Canton of Geneva, Switzerland, 2013, pp [2] U. Kang, C. E. Tsourakakis, and C. Faloutsos, PEGASUS: A Peta- Scale Graph Mining System Implementation and Observations, in Ninth IEEE International Conference on Data Mining, ICDM 09, 2009, pp [3] P. Kalmegh and S. B. Navathe, Graph Database Design Challenges Using HPC Platforms, in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, 2012, pp [4] D. Dominguez-Sal, P. Urbón-Bayes, A. Giménez-Vañó, S. Gómez- Villamor, N. Martínez-Bazán, and J. L. Larriba-Pey, Survey of Graph Database Performance on the HPC Scalable Graph Analysis Benchmark, in Proceedings of the 2010 International Conference on Web-age Information Management, Berlin, Heidelberg, 2010, pp [5] A. Buluç, A. Fox, J. R. Gilbert, S. A. Kamil, A. Lugowski, L. Oliker, and S. W. Williams, High-performance Analysis of Filtered Semantic Graphs, in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, New York, NY, USA, 2012, pp [6] J. Kepner, W. Arcand, W. Bergeron, N. Bliss, R. Bond, C. Byun, G. Condon, K. Gregson, M. Hubbell, J. Kurz, A. MCcabe, P. Michaleas, A. Prout, A. Reuther, A. Rosa, and C. Yee, Dynamic distributed dimensional data model (D4M) database and computation system, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp [7] J. P. Cedeño and K. S. Candan, R2DF Framework for Ranked Path Queries over Weighted RDF Graphs, in Proceedings of the International Conference on Web Intelligence, Mining and Semantics, New York, NY, USA, 2011, pp. 40:1 40:12. [8] R. Angles and C. Gutierrez, Survey of Graph Database Models, ACM Comput Surv, vol. 40, no. 1, pp. 1:1 1:39, Feb [9] J. Hayes and C. Gutierrez, Bipartite Graphs as Intermediate Model for RDF, in The Semantic Web ISWC 2004, S. A. McIlraith, D. Plexousakis, and F. van Harmelen, Eds. Springer Berlin Heidelberg, 2004, pp [10] S. Jouili and V. Vansteenberghe, An Empirical Comparison of Graph Databases, in 2013 International Conference on Social Computing (SocialCom), 2013, pp [11] N. Bronson, Z. Amsden, G. Cabrera III, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani, TAO: Facebook s Distributed Data Store for the Social Graph, USENIX Annu. Tech. Conf. ATC, [12] M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson, Cache-oblivious Streaming B-trees, in Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, New York, NY, USA, 2007, pp [13] E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), [14] Cordella, Luigi P., et al. "A (sub) graph isomorphism algorithm for matching large graphs." Pattern Analysis and Machine Intelligence, IEEE Transactions on (2004):
Driving Big Data With Big Compute
Driving Big Data With Big Compute Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O Gwynn, Andrew Prout, Albert
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
Review of Graph Databases for Big Data Dynamic Entity Scoring
Review of Graph Databases for Big Data Dynamic Entity Scoring M. X. Labute, M. J. Dombroski May 16, 2014 Disclaimer This document was prepared as an account of work sponsored by an agency of the United
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
Full-text Search in Intermediate Data Storage of FCART
Full-text Search in Intermediate Data Storage of FCART Alexey Neznanov, Andrey Parinov National Research University Higher School of Economics, 20 Myasnitskaya Ulitsa, Moscow, 101000, Russia [email protected],
Microblogging Queries on Graph Databases: An Introspection
Microblogging Queries on Graph Databases: An Introspection ABSTRACT Oshini Goonetilleke RMIT University, Australia [email protected] Timos Sellis RMIT University, Australia [email protected]
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Achieving 100,000,000 database inserts per second using Accumulo and D4M
Achieving 100,000,000 database inserts per second using Accumulo and D4M Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Peter Michaleas, Julie
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
Data Store Interface Design and Implementation
WDS'07 Proceedings of Contributed Papers, Part I, 110 115, 2007. ISBN 978-80-7378-023-4 MATFYZPRESS Web Storage Interface J. Tykal Charles University, Faculty of Mathematics and Physics, Prague, Czech
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Oracle Big Data Spatial and Graph
Oracle Big Data Spatial and Graph Oracle Big Data Spatial and Graph offers a set of analytic services and data models that support Big Data workloads on Apache Hadoop and NoSQL database technologies. For
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1
GRAPH DATABASE SYSTEMS h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1 Use Case: Route Finding Source: Neo Technology, Inc. h_da Prof. Dr. Uta Störl Big Data Technologies:
On a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
imgraph: A distributed in-memory graph database
imgraph: A distributed in-memory graph database Salim Jouili Eura Nova R&D 435 Mont-Saint-Guibert, Belgium Email: [email protected] Aldemar Reynaga Université Catholique de Louvain 348 Louvain-La-Neuve,
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
A1 and FARM scalable graph database on top of a transactional memory layer
A1 and FARM scalable graph database on top of a transactional memory layer Miguel Castro, Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Alex Shamis Richie Khanna, Matt Renzelmann Chiranjeeb
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Implementing Graph Pattern Mining for Big Data in the Cloud
Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya [email protected]
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
STINGER: High Performance Data Structure for Streaming Graphs
STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?
HPC2012 Workshop Cetraro, Italy Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? Bill Blake CTO Cray, Inc. The Big Data Challenge Supercomputing minimizes data
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
General announcements In-Memory is available next month http://www.oracle.com/us/corporate/events/dbim/index.html X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
SCALABLE DATA SERVICES
1 SCALABLE DATA SERVICES 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2 Overview MySQL Database Clustering GlusterFS Memcached 3 Overview Problems of Data Services 4 Data retrieval
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Graph Database Performance: An Oracle Perspective
Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Manfred Dellkrantz, Maria Kihl 2, and Anders Robertsson Department of Automatic Control, Lund University 2 Department of
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
Cloud Computing Where ISR Data Will Go for Exploitation
Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Indexing Big Data. Michael A. Bender. ingest data ?????? ??? oy vey
Indexing Big Data 30,000 Foot View Big data of Databases problem Michael A. Bender ingest data oy vey 365 42????????? organize data on disks query your data Indexing Big Data 30,000 Foot View Big data
NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre
NoSQL systems: introduction and data models Riccardo Torlone Università Roma Tre Why NoSQL? In the last thirty years relational databases have been the default choice for serious data storage. An architect
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection Sotirios Beis, Symeon Papadopoulos, and Yiannis Kompatsiaris Information Technologies Institute, CERTH, 57001, Thermi, Greece {sotbeis,papadop,ikom}@iti.gr
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
PERFORMANCE MODELS FOR APACHE ACCUMULO:
Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps
Presto/Blockus: Towards Scalable R Data Analysis
/Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew
A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.
A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud. Tejas Bharat Thorat Prof.RanjanaR.Badre Computer Engineering Department Computer
A Framework of User-Driven Data Analytics in the Cloud for Course Management
A Framework of User-Driven Data Analytics in the Cloud for Course Management Jie ZHANG 1, William Chandra TJHI 2, Bu Sung LEE 1, Kee Khoon LEE 2, Julita VASSILEVA 3 & Chee Kit LOOI 4 1 School of Computer
SQL Server 2012 Performance White Paper
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy, HP Labs November 2013 Team: Shivara m Erik Kyungyon g Alvin Rob Vanish A Big Data story Once upon a time, a customer in distress had. 2+ billion rows of financial
Cloud Computing and Advanced Relationship Analytics
Cloud Computing and Advanced Relationship Analytics Using Objectivity/DB to Discover the Relationships in your Data By Brian Clark Vice President, Product Management Objectivity, Inc. 408 992 7136 [email protected]
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Processing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Review: Graph Databases
Volume 4, Issue 5, May 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Pallavi Madan M.Tech,Dept
FAWN - a Fast Array of Wimpy Nodes
University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed
Load Balancing on a Grid Using Data Characteristics
Load Balancing on a Grid Using Data Characteristics Jonathan White and Dale R. Thompson Computer Science and Computer Engineering Department University of Arkansas Fayetteville, AR 72701, USA {jlw09, drt}@uark.edu
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Oracle Big Data Spatial & Graph Social Network Analysis - Case Study
Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Mark Rittman, CTO, Rittman Mead OTN EMEA Tour, May 2016 [email protected] www.rittmanmead.com @rittmanmead About the Speaker Mark
A Survey Study on Monitoring Service for Grid
A Survey Study on Monitoring Service for Grid Erkang You [email protected] ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide
Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Cray: Enabling Real-Time Discovery in Big Data
Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects
Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro
Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,
Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
Map-Based Graph Analysis on MapReduce
2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract
