A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics
|
|
|
- Lindsay Todd
- 10 years ago
- Views:
Transcription
1 A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics Ilie Tanase 1, Yinglong Xia 1, Lifeng Nai 2, Yanbin Liu 1, Wei Tan 1 Jason Crawford 1, and Ching-Yung Lin 1 1 IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA 2 Georgia Institute of Technology, Atlanta, GA 30332, USA {igtanase,yxia,ygliu,wtan,ccjason,chingyung}@us.ibm.com [email protected] ABSTRACT Graph analytics on big data is currently a very active area of research in both industry and academia. To support graph analytics efficiently a large number of graph processing systems have emerged targeting various perspectives of a graph application such as in memory and on disk representations, persistent storage, database capability, runtimes and execution models for exploiting parallelism, etc. In this paper we discuss a novel graph processing system called System G Native Store which allows for efficient graph data organization and processing on modern computing architectures. In particular we describe a runtime designed to exploit multiple levels of parallelism and a generic infrastructure that allows users to express graphs with various in memory and persistent storage properties. We experimentally show the efficiency of System G Native Store for processing graph queries on state-of-the-art platforms. 1. INTRODUCTION Large scale graph analytics is a very active area of research in both academia and industry. Researchers have found high impact applications in a wide variety of Big Data domains, ranging from social media analysis, recommendation systems, and insider threat detection, to medical diagnosis and protein interactions. These applications often handle a vast collection of entities with various relationship, which are naturally represented by the graph datastructure. A graph datastructure G(V,E) consists of a set of vertices V and a set of edges E representing relations between various vertices. The design of a graph processing system (GPS) includes two major components: the graph data structures and the programming model. In this paper we discuss the design decisions we employed in System G Native Store to create a flexible graph library that can be tuned and customized by users based on their application specific needs. For the graph data structure some of the major decisions to be made are: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. GRADES 14, June , Snowbird, UT, USA Copyright 2014 ACM /14/06$ Is the graph stored in memory, on disk, or both? Is the graph directed or undirected? Is the graph sparse or dense? Are there properties associated with vertices and edges? Are ACID properties supported? Depending on a particular problem to be solved, the answers to these questions vary, and often the more functionality that is required, the more complicated the design becomes. Once the data structure is decided upon, the next big decision is selecting the programming model exposed to users. In addition to the datastructure API [3] the programming model may include support for graph algorithms, expressing parallelism on shared memory machines or large clusters, and support for fault tolerance. As part of the programming model users are provided with an interface that can vary from a full runtime system [9][2][8] to a specialized query language[11][1]. In the rest of this paper we answer the questions we raised in this section in the context of a novel graph processing solution called System G Native Store. In Section 2 we discuss related work, Section 3 provides more details of System G, Section 4 discusses our graph datastructure design decisions and Section 5 presents the runtime system. We evaluate the performance of our GPS in Section 6 and conclude in Section BACKGROUND AND RELATED WORK In this section we discuss related projects and we classify them into three broad categories: graph datastructure libraries, graph processing frameworks, where the emphasis is on the programming models, and graph databases, where the focus is on storage. Graph libraries: Graph libraries for in-memory-only processing have been available for a long time. For example BOOST Graph library (BGL) [12] provides a generic graph library where users can customize multiple aspects of the datastructure including directness, in memory storage, and vertex and edge properties. This flexibility gives users great power in customizing the datastructure for their particular needs. Parallel BOOST graph library [6], STAPL [7] and Galois [10], provide in memory parallel graph datastructures. All these projects provide generic algorithms to access all vertices and edges, possibly in parallel, without knowledge of the underlying in-memory storage implementation. System G Native Store employs a similar design philosophy with these libraries but it extends these works with support for persistent storage and a flexible runtime for better work
2 scheduling. Graph processing frameworks: Pregel and Giraph [9, 2] employs a parallel programming model called Bulk Synchronous Parallel (BSP) where the computation consists of a sequence of iterations. In each iteration, the framework invokes a user-defined function for each vertex in parallel. This function usually reads messages sent to this vertex from last iteration, sends messages to other vertices that will be processed at the next iteration, and modifies the state of this vertex and its outgoing edges. GraphLab [8] is a parallel programming and computation framework targeted for sparse data and iterative graph algorithms. Pregel/Giraph and GraphLab are good at processing sparse data with local dependencies using iterative algorithms. However they are not designed to answer ad hoc queries and process graph with rich properties. TinkerPop [3] is an open-source graph ecosystem consisting of key interfaces/tools needed in the graph processing space including the property graph model (Blueprints), data flow (Pipes), graph traversal and manipulation (Gremlin), graph-object mapping(frames), graph algorithms(furnace) and graph server (Rexster). Interfaces defined by TinkerPop are becoming popular in the graph community. As an example, Titan [4] adheres to a lot of APIs defined by Tinker- Pop and uses data stores such as HBase and Cassandra as the scale-out persistent layer. TinkerPop focuses on defining data exchange formats, protocols and APIs, rather than offering a software with good performance. Graph stores: Neo4J[11] provides a disk-based, pointerchasing graph storage model. It stores graph vertices and edges in a denormalized, fixed-length structure and uses pointer-chasing instead of index-based method to visit them. By this means, it avoids index access and provides better graph traversal performance than disk-based RDBMS implementations. 3. SYSTEM G GRAPH DATABASE OVERVIEW Comprehensive graph processing solution: System G Graph Database has been actively developed by IBM, where G stands for graph. It provides a whole spectrum graph solution shown in Figure 1, covering graph visualization, analytics, runtimes, and storage. Although System G Graph Database includes a version based on HBase/HDFS[5] and an interface using DB2/DB2RDF or TinkerPop compatible DBs, in this paper we focus on a novel version called System G Native Store, which is independent of any existing product or open source software. High performance graph runtime: System G Graph Database provides a generic C++ sequential runtime library, a concurrent runtime library for multithreading computing platforms, and a distributed runtime library for computer clusters. For distributed library, it offers an efficient communication layer implemented by the IBM Parallel Active Message Inference (PAMI) and the Remote Direct Memory Access (RDMA). All libraries are highly optimized towards state-of-the-art computer architectures such as POWER multicore. Additional graph computing accelerators based on FPGA and GPU are reserved for exploring highly parallel implementation for some graph computing primitives. Native Graph Storage: Despite of several storage supported by System G Graph Database shown in Figure 1, we focus on the Native Store for graphs. This store persists a graph by saving its structure and properties on both vertices Analytics Middleware Database I2 3D Network Communities Centralities Ego Net Features BigInsights Hadoop Network Propagation Infosphere Streams (ISS) GBase (update, scan, operators, indexing)) HBase HDFS Graph Search Graph Query Graph Matching Huge Network Geo Network Network Info Flow Shortest Paths Graph Sampling Graph Processing Interface Shared Memory Graph Library Generic Graph Library Graph Data Interface TinkerPop Compliant DBs Distr. Memory Graph Library Graph Communication Layer (PAMI/RDMA) DB2 RDF DB2 Graphical Model Bayesian Networks Latent Net Inference Markov Networks Graph Accelerator FPGA/HMC Native Store Figure 1: System G Graph Database provides whole spectrum solution for graph processing and edges into a set of files. Since it is a file-based store implemented using C++, it is not only highly efficient, but also very portable to various file systems and storage hardware. The graph structure and properties are stored separately, and the properties can be further divided into subgroups. Thus, it avoids loading unnecessary data for analytics. The Native Store also supports multiple versioning at the vertex level, which is the basis for the bi-temp feature and transaction management in database. 4. GENERIC GRAPH DATA STRUCTURE A graph consists of a collection of vertices and edges. In this section we discuss the functionality our graph library supports as we answer the questions raised in the introduction. Our in memory graph representation uses the adjacency list to store vertices and edges. Thus, the graph stores a set of vertices and each vertex individually maintains the list of its neighboring edges or its adjacency. We have chosen the adjacency list versus either representations like Compressed Sparse Row (CSR), list of edges or adjacency matrix due to its high flexibility in accommodating dynamic graphs where vertices and edges are added and removed while simultaneous supporting traversals. Similar to BGL or STAPL our framework employs a generic C++ design using templates to allow users to customize a particular graph datastructure. The core graph data structuremodelsasinglepropertygraph. In thismodelthegraph stores one property for each vertex and one property for each edge as shown in Figure 2, Lines 1-5. Users can define different graph classes by appropriately customizing any of the available template arguments. The first template argument is the vertex property; the second template argument is the edge property; the third argument specifies if the graph is directed, undirected or directed with predecessors; the last template argument allows for fine, low level customizations related to the graph storage like the particular storage datastructure for vertices and edges. We selected a generic programming design as it provides our users the polymorphism flexibility without the runtime overhead of virtual inheritance. Independent of the template arguments used to instantiate it, the graph provides an interface to add and delete vertices and edges and to access the data. The interface is similar to Tinkerpop BluePrints API. The most important methods of the SG graph class are
3 1 template <class VertexProperty, 2 class EdgeProperty, 3 class DIRECTEDNESS, 4 class Traits > 5 class Graph { 6 typedef... vertex descriptor ; 7 typedef... edge descriptor ; 8 typedef... vertex iterator ; 9 typedef... edge iterator ; 10 vertex descriptor add vertex (VertexProperty&); 11 edge descriptor add edge( vertex descriptor v1, 12 vertex descriptor v2, 13 edge property&); 14 vertex iterator find vertex ( vertex descriptor ); 15 } typedef Graph<int, double,directed> graph1 type ; class my vertex property {...} 20 typedef Graph<my vertex property, double, 21 UNDIRECTED> user graph2 type ; template <class G> 24 process vertex (G& g, vertex descriptor vid){ 25 vertex iterator vit = g. find vertex (vid ); 26 //process vertex property 27 // vit >property (); 28 edge iterator eit = vit >edges begin (); 29 for (; eit!= vit >edges end();++ eit ){ 30 //process edge identified by eit 31 // eit >target (); eit >property (); } 33 } Graph<VertexProperty, EdgeProperty, Directness, Traits> MultiPropertyGraph<StorageType> indiskgraph<vertexproperty, EdgeProperty, Directness, Traits> Figure 3: Native Store class hierarchy. The base graph class is in memory only. The indiskgraph derives from it and adds storage capability. The multiproperty graph can derive from either base graph or indiskgraph as requested by user. Vertex: Label Set <PropertyName: PropertyValue> Edge: Label Set <PropertyName: PropertyValue> Figure 4: Multi property graph Figure 2: System G Native Store Graph Interface add vertex, add edge, and find vertex. add_vertex shown in Figure 2, Line 10 creates a new vertex in the graph and returns the vertex identifier associated with it. The vertex identifiercanbeusedtoaccessandmodifythevertexsuchas adding edges to the vertex, obtaining an iterator(pointer) to the vertex to inspect its properties or its adjacency list, etc. The method add_edge shown in Figure 2, Line 11, creates a new edge in the graph between two vertices and with a given initial property. Line 14 shows find_vertex which returns an iterator pointing to the vertex data structure. The interface includes additional methods that can t be all described here due to the lack of space. 4.1 Vertex processing In Figure 2, lines we show a simple example of a vertex based computation. First, a vertex is always identified by a unique vertex identifier and the first step to be done before querying properties of the vertex is to map from the vertex identifier to a vertex reference (vertex iterator). Having access to a vertex iterator, one can access the vertex properties (Figure 2, line 27) and information about the adjacency list as shown in Figure 2, lines 28, 29. Traversing all outgoing edges of a vertex can be accomplished with the loop in Figure 2, lines and using an edge iterator one can access information about a particular edge like source, target and edge property. All other graph computations in our framework can be expressed as compositions of the pattern included in this example. 4.2 Native Store graph classes hierarchy The generic graph datastructure introduced in previous section provides enough functionality for a large category of in-memory graph algorithms and analytics. In this sec- tion we introduce two more important extensions of our core graph datastructure: multiproperty graph and graph with persistent storage as shown in Figure Multiproperty graph A common graph used for graph analytics is the multiproperty graph where each vertex and edge can have an arbitrary number of properties. This functionality is provided by our graph framework by instantiating the base graph class with a multiproperty class for both vertices and edges. A multiproperty class is essentially a map data structure allowing dynamic insertion and deletion of an arbitrary number of key, value pairs. In Figure 4 we show a depiction of a multiproperty graph. Each vertex has associated a vertex typeid which is a mechanism of grouping the vertices based on similarity. For example in a social network graph where we have people, forums, and posts as vertices one can associate a typeid with each group of vertices and later perform analytics over vertices of a particular typeid. In addition to the typeid each vertex canstoreanarbitrarynumberofkey, valuepairsthatwewill refer to as property names and property values. Similarly for edges we associate an edge label and an arbitrary number of property, value pairs. Additionally the multiproperty graph is directed, but it does keep track of the predecessors for each vertex. Thus one can iterate efficiently over outgoing and incoming edges of a particular vertex. In addition to the multiproperty support already mentioned a property graph provides an additional interface to the user for allowing traversal of vertices and edges of a particular typeid or label, and additional support for indexing. For example if for every vertex there is a property called LastName, one may want to search a vertex where Last- Name has a particular value. The multiproperty graph allows users to specify an index on a particular property name and subsequent operations on graph will have to maintain
4 Version management Graph structure in-edges out-edges Graph property property Figure 5: Property graph persistency in System G Native Store the index and use it for faster lookups. 4.4 Persistent graph System G provides persistent storage for property graphs as a natural extension of the in-memory graph data structure. Compared to existing graph databases, this graph storage is featured by its analytic amenable design. The basic idea of the storage is shown in Figure 5, which can be described as follows: Separating structure and property: System G stores graph structure and properties separately. The graph structure breaks down into the adjacent incoming edges and outgoing edges for each vertex. Such separation avoids loading graph properties for analytics defined merely on graph topology, e.g. breadth first search (BFS) and graph betweenness. This not only reduces disk IO consumption but also critically improves memory utilization and reduces partition count for large shared-memory graphs. Efficient graph loading: System G allows loading a graph on demand, where we start with an empty graph in memory and incrementally load the vertices and edges as needed. This feature allows us to launch a graph analytics fast and also handle graphs larger than the size of memory. SystemGusestheinternalvertexIDsandedgeIDsasoffsets to store the graph data on disk. Regardless of the size of the graph, it locates the vertex data immediately without incurring any scan or searching. In addition, it is worth noting that it stores adjacent edges to a vertex contiguously, unlike some existing systems e.g. Neo4j. Therefore, System G achieves improved data locality. Versioning support: System G persistent graphs associate time stamps with each vertex. Therefore, it allows transactions and rollback in graph operations, which is critical in many real applications. Portability: System G persistent graph utilizes a set of files to store graph data and therefore can be virtually hosted in most file systems and random access hardware, such as regular Linux file system and HDFS on HDD or SSD. 4.5 Concurrent and Distributed Graph To fully exploit the large number of threads available on most modern machines, we provide a multi-threaded graph (MTG) data structure within our framework. We take a compositional approach for this. A MTG is composed of an arbitrary number of nearly-independent subgraphs. This approach allows us to maximize coarse grain parallelism. However, we also enable each individual thread to access any of the subgraphs and this may lead to unsafe concurrent access from multiple threads. The MTG data structure provide atomicity for all of its methods and in the current implementation only one thread can operate on a subgraph at a given point of time. For example building an MTG using multiple threads is efficiently done as each thread can potentially add and access vertices from independent subgraphs thus leading to no conflicts. If the application leads to situations where two threads access the same vertex or edge then the access will be serialized using locking. Similar to MTG we provide a distributed memory graph (DISTG). A DISTG is thus composed of an arbitrary number of subgraphs, with one or more subgraphs on each machine. The runtime of our library provides remote procedure call (RPC) as the main modality of accessing remote data or performing certain computations on remote data and eventually returning results. The distributed memory graph is currently work in progress that will be deployed in our library relatively soon. 4.6 Java interoperability Native Store supports Java clients through an in-process JNI layer that maps the concepts and methods of the multiproperty C++ graph to Java static methods. This has allowed the System G team to leverage existing open source Java-based code bases to implement additional graph features. As a result, System G users have the ability to access their graphs through Groovy, Gremlin and SPARQL. In most cases Java clients do not program to the JNI interface but instead program to a TinkerPop Blueprints layer built upon the JNI methods. 5. PROGRAMMING MODEL In Section 4 we had an overview of the main interfaces of the graph data structures available in System G Native Store. In this section we present in more detail the programming model exposed to the users. 5.1 Single thread programming The graph exposed by our library is a two dimensional data structure consisting of a set of vertices and a set of edges. We employ an adjacency list and the consequence of this is that we don t store the edges as a contiguous set but each vertex stores its outgoing edges. Thus traversing the whole structure of a graph is often a composition of two nested loops: a first one over each vertex, and a nested one over each edge of a vertex. This is exemplified in Figure 6. If the user requires some analysis over vertices and edges of particular labels then the label needs to be provided as argument to begin and end methods. If the user performs a local query starting at a particular vertex identified by its vertex descriptor, the code as shown in Figure 2, lines 23-34, can be used. 5.2 Support for concurrent execution Most parallel machines available today employ multicores and System G Native Store library provides the necessary support to exploit multiple threads. Native Store runtime is the component of our framework that exposes to developers a task based model of computation isolating them from notions like posix threads, cores, SMT, etc. A parallel computation in Native Store consists of a set of tasks with possible dependencies between them. The set of tasks for a particular computation in general is decided
5 1 template <class G> 2 void process graph (G& g){ 3 vertex iterator vit= g. vertices begin (); 4 for (; vit!=g. vertices end();++vit ){ 5 edge iterator eit = vit >edges begin (); 6 for (; eit!= vit >edges end();++ eit ){ 7 //process edge identified by eit 8 // eit >target (); eit >property ();... 9 } 10 } 11 } Figure 6: Native Store whole graph traversal 1 template <class Graph> 2 class addverts wf : public ibmppl :: work function { 3 virtual void execute( task id ){ 4 } 5 }; 6 7 //Case 1 : populate graph with vertices 8 addverts wf<graph> av(g ); 9 ibmppl :: execute tasks(&av, 2 num threads ); //Case 2 : perform a computation on each vertex 12 ibmppl :: for each (g, process vertex ); //Case 3 : perform a computation for each vertex of 15 // the task graph 16 ibmppl :: task graph tg ; 17. add vertices ( tasks ) and edges ( dependencies) to tg 18 ibmppl :: schedule task graph (tg ); Figure 7: Expressing parallelism in System G Native Store when the computation is started(static computation). However there are situations when tasks can be created dynamically by other tasks. This situation is accommodated by our framework with the only complication being the internalmechanismtotrackdownthenumberoftaskcreatedand completed in order to let the user know when the computation is finished. Tasks created after the start of computation are also managed by the runtime scheduler which employs work stealing to balance the computations between threads. For performance reasons work stealing is important for such system where the amount of work per vertex is variable as it often depends on how many successors, predecessors the vertex has. On top of the task based execution model we implemented in System G Native Store a set of primitives to abstract the parallelism from the user. In Figure 7, Lines 1-9, we show a simple example on how users can specify a work function which is the body of the task, and how it can schedule a number of tasks that is a multiple of the number of cores on the machine. The particular code in this figure can be used for example to populate the graph with vertices and edges. If the graph already has data, then the number of the tasks created can vary from the number of cores to the number of vertices in the graph depending on the granularity of the computation per vertex. Using execute_tasks() we don t specify a partition of data per task and we leave that to the user. A second primitive we provide as part of our runtime is the for_each(), shown in Figure 7, Line 11. In this case the users is requesting the work function process_vertex depicted in Figure 2 to be invoked on each vertex of the Vertices Size Properties Load Time(s) Person 100, Post 54,784, Forum 3,676, Place 5, Tag 12, Edges Person-knows-person 2,887, Person-likes-post 208,241, Post-hasCreator-person 54,784, Post-hasTag-tag 42,797, Forum-contains-post 54,784, Figure 8: Input dataset used for evaluation graph. The framework underneath decides how to create tasks to maximize the throughput of vertices processed per second. A last primitive supported by Native Store runtime is schedule_task_graph() exemplified in Figure 7, Lines In this case the user provides as input a task graph which is a directed acyclic graph. Each vertex represent a task identified by its unique task identifier and its associated work function. The edges represent dependencies that will be enforced when processing tasks. The execution will first set the number of incoming dependencies for each vertex and it will start executing the vertices/tasks with no dependencies. When a task is completed it decrements a count on all its successor vertices (tasks). If other vertices have their count to zero then they will be scheduled for execution. Work stealing in this case happens only across ready to run tasks. With the above support we anticipate users will be able to exploit parallelism decoupled from lower level details about the machine and thus allowing the runtime to perform the mapping to the machine. The concepts we introduced in this section are not new and they are currently employed in other libraries like Intel TBB, STAPL[7], Galois[10]. The novel part that we are focusing on in System G Native Store, is to employ these patterns to generalize some of the existing programming models like Pregel [9], Giraph [2] that target mainly fully parallel vertex centric computations disregarding for example the fact that you may need to process vertices in a particular order. 6. PERFORMANCE EVALUATION We evaluate in this section the performance of our graph library and runtime system using benchmarks that import a large corpus of data into the graph datastructure and subsequently performs a set of queries on the graph. We use for our experimental evaluation a social network dataset generated using the Linked Data Benchmark Council (LDBC) graph generator. The dataset is generated apriori and stored in CSV files on the disk. The vertices and edges considered for the experiments in this section are described in Figure Graph construction We imported the input dataset as described in Figure 8 using an Intel Haswell server with two processors, each processor 12 cores running at 2.7GHz and 256 GB of memory. The input CSV files are stored on an SSD disk. The graph is built using one thread, in memory only and is the multiproperty graph as described in Section 4.3. We include in
6 Figure 8 in the fourth column the time it took to read individual input files and add corresponding vertices and edges in memory. In addition to the corresponding add vertex and edge methods the execution time includes reading from file, parsing input, add related properties. It also includes the execution time to add entries into an index from an external vertex identifier specified in the input files to the internal vertex identifier used by our graph library. In general the ingestion time depends on the number of properties considered. For example when adding the post vertices we achieve a rate of 290,000 vertices per second. 6.2 Graph queries In this section we evaluate three different queries performed on the dataset we considered. LDBC project provides a set of seven sample queries and in this work we evaluated query number two, four and six 1, denoted by Query 1, 2, and 3 in this paper, respectively. Query 1 finds the newest 20 posts among one person friends; Query 2 finds the top 10 most popular topics/tags (by the number of comments and posts) that ones friends have been talking about in the last x hours; and Query 3 finds 10 most popular tags by people that are among your friends and friend-of-friends, which appear in posts where tag x is also mentioned. The queries are implemented exactly as specified on the LDBC site, retrieving all the required fields. Essentially, all queries are localized searches through the graph datastructure starting from a particular vertex and subject to various filtering conditions. One can notice the importance of having labels for edges and more importantly having an efficient storage that will allow traversals of only edges and vertices of a particular label. In Figure 9 we illustrate the distribution of 1000 query (Query 1) times where the starting vertex was randomly chosen. The bars depict how many queries had the average execution time shown on X axis, while the line and the right Y axisshowtheaveragenumberofedgestraversedperquery for a particular bucket. From the figure we notice first that for this types of graphs most queries have a relatively low number of edges traversed (1000 for the first bar). However for a small number of queries the number of edges traversed is close to 70k. Secondly, as expected the execution time increases with the number of edges traversed by query but for all queries considered the time was under 25 microseconds. The throughput of the queries is shown in Figure 10, where the metric on the y-axis is the number of traversed edges per second (TEPS). The figure clearly shows that System G Native Store scales well for various number of threads and all the queries, and the throughput is excellent, reaching 160 million edges per second. 7. CONCLUSION AND FUTURE WORK This paper presents a high performance graph library in IBM System G Native Store, a whole spectrum solution for graph processing. This runtime library is not only easy to use due to its intuitive APIs, but is also highly efficient for various graph analytics. It preserves graph data locality and supports multiproperty graphs, persistent graphs, concurrent and distributed graphs. Preliminary experiments show highly efficient graph processing performance using a social network dataset with 500 million edges. 1 socialnet bm neo4j/wiki/queries Figure 9: Distribution of Query 1 Times Figure 10: Throughput of Graph Queries 8. REFERENCES [1] SPARQL. [2] Apache giraph [3] Tinkerpop [4] Titan distributed graph database [5] M. Canim and Y. Chang. System G data store: Big, rich graph data analytics in the cloud. In IEEE International Conference on Cloud Engineering, [6] D. Gregor and A. Lumsdaine. The parallel bgl: A generic library for distributed graph computations. In In Parallel Object-Oriented Scientific Computing, [7] Harshvardhan, A. Fidel, N. Amato, and L. Rauchwerger. The stapl parallel graph library. In Languages and Compilers for Parallel Computing, [8] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. arxiv preprint arxiv: , [9] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages ACM, [10] K. Pingali. High-speed graph analytics with the galois system. In Proceedings of the First Workshop on Parallel Programming for Analytics Applications, PPAA 14, pages 41 42, [11] I. Robinson, J. Webber, and E. Eifrem. Graph Databases. O Reilly Media, Incorporated, [12] J. Siek, A. Lumsdaine, and L.-Q. Lee. Boost graph library,
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
BSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
Microblogging Queries on Graph Databases: An Introspection
Microblogging Queries on Graph Databases: An Introspection ABSTRACT Oshini Goonetilleke RMIT University, Australia [email protected] Timos Sellis RMIT University, Australia [email protected]
SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here
PLATFORM Top Ten Questions for Choosing In-Memory Databases Start Here PLATFORM Top Ten Questions for Choosing In-Memory Databases. Are my applications accelerated without manual intervention and tuning?.
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12
TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12 Matthias Broecheler, CTO August VIII, MMXII AURELIUS THINKAURELIUS.COM Abstract Titan is an open source distributed graph database build on top
GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1
GRAPH DATABASE SYSTEMS h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe 2016 1 Use Case: Route Finding Source: Neo Technology, Inc. h_da Prof. Dr. Uta Störl Big Data Technologies:
The Synergy Between the Object Database, Graph Database, Cloud Computing and NoSQL Paradigms
ICOODB 2010 - Frankfurt, Deutschland The Synergy Between the Object Database, Graph Database, Cloud Computing and NoSQL Paradigms Leon Guzenda - Objectivity, Inc. 1 AGENDA Historical Overview Inherent
E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I
E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
Cray: Enabling Real-Time Discovery in Big Data
Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
A1 and FARM scalable graph database on top of a transactional memory layer
A1 and FARM scalable graph database on top of a transactional memory layer Miguel Castro, Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Alex Shamis Richie Khanna, Matt Renzelmann Chiranjeeb
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world
Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Big Data looks Tiny from the Stratosphere
Volker Markl http://www.user.tu-berlin.de/marklv [email protected] Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
imgraph: A distributed in-memory graph database
imgraph: A distributed in-memory graph database Salim Jouili Eura Nova R&D 435 Mont-Saint-Guibert, Belgium Email: [email protected] Aldemar Reynaga Université Catholique de Louvain 348 Louvain-La-Neuve,
On a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
Component visualization methods for large legacy software in C/C++
Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University [email protected]
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
<Insert Picture Here> Oracle In-Memory Database Cache Overview
Oracle In-Memory Database Cache Overview Simon Law Product Manager The following is intended to outline our general product direction. It is intended for information purposes only,
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
How To Improve Performance In A Database
Some issues on Conceptual Modeling and NoSQL/Big Data Tok Wang Ling National University of Singapore 1 Database Models File system - field, record, fixed length record Hierarchical Model (IMS) - fixed
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
How graph databases started the multi-model revolution
How graph databases started the multi-model revolution Luca Garulli Author and CEO @OrientDB QCon Sao Paulo - March 26, 2015 Welcome to Big Data 90% of the data in the world today has been created in the
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Graph Database Performance: An Oracle Perspective
Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and
MySQL Cluster 7.0 - New Features. Johan Andersson MySQL Cluster Consulting [email protected]
MySQL Cluster 7.0 - New Features Johan Andersson MySQL Cluster Consulting [email protected] Mat Keep MySQL Cluster Product Management [email protected] Copyright 2009 MySQL Sun Microsystems. The
Accelerating In-Memory Graph Database traversal using GPGPUS
Accelerating In-Memory Graph Database traversal using GPGPUS Ashwin Raghav Mohan Ganesh University of Virginia High Performance Computing Laboratory [email protected] Abstract The paper aims to provide
Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale
WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
Review of Graph Databases for Big Data Dynamic Entity Scoring
Review of Graph Databases for Big Data Dynamic Entity Scoring M. X. Labute, M. J. Dombroski May 16, 2014 Disclaimer This document was prepared as an account of work sponsored by an agency of the United
IBM DB2 Near-Line Storage Solution for SAP NetWeaver BW
IBM DB2 Near-Line Storage Solution for SAP NetWeaver BW A high-performance solution based on IBM DB2 with BLU Acceleration Highlights Help reduce costs by moving infrequently used to cost-effective systems
What's New in SAS Data Management
Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases
