Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012
Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships can be abstracted as graphs.
Graph Computing Everywhere Graph Algorithm: Max Flow (Min Cut). Web Page Integration: Page Rank. Social Network Application: Friendship Mining. Search the results from your social network
Graph - Bottleneck EX: Just like Bing s friendship search. If we want to know our friends friends friends idea (which is a 3 hops of neighborhood) The edges we would like to traverse are: 130 + 130^2 + 130^3 = 2.2M --- ORM can traverse 1,000 relationships in 1 second. Statistics: Huge!!! How to store and How to compute? Type Nodes Edges Size US Road Graph 2.4*10^7 6.0*10^7 788MB Web Graph 2.0*10^10 1.6*10^11 1494GB Facebook Graph 8*10^8 1.0*10^11 787GB
Graph Datastore Basically, graph datastore is database (NoSQL DB) uses graph structures with nodes, edges, and properties to represent and store data, which is highly optimized in the data layout, indexes and query mechanisms. These datastores are more about online query processing, in which low latency is always the core part. (Respond to a web request) EX: HyperGraphDB, Neo4j, FlockDb, Trinity.
Graph Computing System Graph Computing System emphasizes more on the computation model and framework to solve large-scale graph algorithm. These systems are more about the offline analytics, which is aiming at the high throughput. (Graph mining) EX: Pregel, MapReduce, PEGASUS, Trinity.
Graph Datastore Trinity Trinity, a memory-based distributed database and computation platform that supports online query processing and offline analytics on graphs. + Cell based data model. + Global memory addressing. + High performance. - Low scalability.
Graph Datastore FlockDB FlockDB is a distributed graph database for storing adjancency lists. Open source, built upon MySQL, in Twitter. + Partitioned by user_id. + Edges stored in both directions, index by (src, dest). + Optimized query mechanism. (Written in scala) src_id dest_id other 20 12 20 13 20 16 20 18 dest_id src_id other 12 20 12 36 12 40 12 42 Forward Backward
Graph Datastore Others HyperGraphDB HyperGraphDB is a (hyper)graph database designed mostly for knowledge representation, AI and semantic web projects, it can also be used as an embedded object-oriented database for Java projects of all sizes. Neo4j Neo4j storing data in the nodes and relationships of a graph. Disk-based, a powerful traversal framework for high-speed in the node space. Provided APIs on the programming language level (double weight()). Not so good in terms of scalability.
Graph Computing System Vertex-based A computation task is expressed in multiple iterative super-steps and each vertex acts as an independent agent, the vertex-based computation model is a special BSP model. Disadvantage: - Memory limitation. - Network overhead. - Superlinear complexity.
Graph Computing System MR-based Use MapReduce computation framework to obtain scalability and simple programming. PEGASUS discover an important primitive for some graph algorithm. (Matrix-vector multiplication) Linear complexity. Disadvantage: - Totally rethinking for the graph algorithm. - High IO overhead (No global data structure). - Superlinear complexity. - EX: BC on Daytona
Challenge - Locality When traversing the graph, where to access the next node? - Network communication with another machine? - Random read on the disk? Solution in the graph datastore: + Distributed in memory architecture. + Index or inverted index for nodes. + Partition for nodes.
Challenge - Partition How to partition a graph, especially some dynamic graphs like social network? A B Potential solution: + Partition by their centrality. + Replication.
Challenge Network && IO Overhead Vertex-based approach + Machine-to-machine message passing. + Bipartite the graph. MapReduce-based approach + Partition the graph, enhance the locality. + Graph datastore upon the DFS.
Future Work Disk based graph computation model and approach. + Layout mechanism. -> Graph datastore. + Computation mechanism. -> Vertex-based. MR-based. Some systems like Hama, Giraph. + Build upon Hadoop and HDFS. + Adopt the pregel model.
Thanks -- Stay hungry, stay foolish.