Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Size: px

Start display at page:

Download "Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012"

Milo Morrison
9 years ago
Views:

1 Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group)

2 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships can be abstracted as graphs.

3 Graph Computing Everywhere Graph Algorithm: Max Flow (Min Cut). Web Page Integration: Page Rank. Social Network Application: Friendship Mining. Search the results from your social network

4 Graph - Bottleneck EX: Just like Bing s friendship search. If we want to know our friends friends friends idea (which is a 3 hops of neighborhood) The edges we would like to traverse are: ^ ^3 = 2.2M --- ORM can traverse 1,000 relationships in 1 second. Statistics: Huge!!! How to store and How to compute? Type Nodes Edges Size US Road Graph 2.4*10^7 6.0*10^7 788MB Web Graph 2.0*10^10 1.6*10^ GB Facebook Graph 8*10^8 1.0*10^11 787GB

to traverse are: 130 + 130^2 + 130^3 = 2.2M --- ORM can traverse 1,000 relationships in 1 second.

5 Graph Datastore Basically, graph datastore is database (NoSQL DB) uses graph structures with nodes, edges, and properties to represent and store data, which is highly optimized in the data layout, indexes and query mechanisms. These datastores are more about online query processing, in which low latency is always the core part. (Respond to a web request) EX: HyperGraphDB, Neo4j, FlockDb, Trinity.

6 Graph Computing System Graph Computing System emphasizes more on the computation model and framework to solve large-scale graph algorithm. These systems are more about the offline analytics, which is aiming at the high throughput. (Graph mining) EX: Pregel, MapReduce, PEGASUS, Trinity.

These systems are more about the offline analytics, which is aiming at

7 Graph Datastore Trinity Trinity, a memory-based distributed database and computation platform that supports online query processing and offline analytics on graphs. + Cell based data model. + Global memory addressing. + High performance. - Low scalability.

processing and offline analytics on graphs.

8 Graph Datastore FlockDB FlockDB is a distributed graph database for storing adjancency lists. Open source, built upon MySQL, in Twitter. + Partitioned by user_id. + Edges stored in both directions, index by (src, dest). + Optimized query mechanism. (Written in scala) src_id dest_id other dest_id src_id other Forward Backward

+ Edges stored in both directions, index by (src, dest). + Optimized query mechanism.

9 Graph Datastore Others HyperGraphDB HyperGraphDB is a (hyper)graph database designed mostly for knowledge representation, AI and semantic web projects, it can also be used as an embedded object-oriented database for Java projects of all sizes. Neo4j Neo4j storing data in the nodes and relationships of a graph. Disk-based, a powerful traversal framework for high-speed in the node space. Provided APIs on the programming language level (double weight()). Not so good in terms of scalability.

Neo4j Neo4j storing data in the nodes and relationships of a graph.

10 Graph Computing System Vertex-based A computation task is expressed in multiple iterative super-steps and each vertex acts as an independent agent, the vertex-based computation model is a special BSP model. Disadvantage: - Memory limitation. - Network overhead. - Superlinear complexity.

agent, the vertex-based computation model is a special BSP model.

11 Graph Computing System MR-based Use MapReduce computation framework to obtain scalability and simple programming. PEGASUS discover an important primitive for some graph algorithm. (Matrix-vector multiplication) Linear complexity. Disadvantage: - Totally rethinking for the graph algorithm. - High IO overhead (No global data structure). - Superlinear complexity. - EX: BC on Daytona

(Matrix-vector multiplication) Linear complexity.

12 Challenge - Locality When traversing the graph, where to access the next node? - Network communication with another machine? - Random read on the disk? Solution in the graph datastore: + Distributed in memory architecture. + Index or inverted index for nodes. + Partition for nodes.

13 Challenge - Partition How to partition a graph, especially some dynamic graphs like social network? A B Potential solution: + Partition by their centrality. + Replication.

14 Challenge Network && IO Overhead Vertex-based approach + Machine-to-machine message passing. + Bipartite the graph. MapReduce-based approach + Partition the graph, enhance the locality. + Graph datastore upon the DFS.

15 Future Work Disk based graph computation model and approach. + Layout mechanism. -> Graph datastore. + Computation mechanism. -> Vertex-based. MR-based. Some systems like Hama, Giraph. + Build upon Hadoop and HDFS. + Adopt the pregel model.

16 Thanks -- Stay hungry, stay foolish.

Large-Scale Data Processing

Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase