Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012



Similar documents
Large-Scale Data Processing

Big Graph Processing: Some Background

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Graph Processing and Social Networks

Challenges for Data Driven Systems

Trinity: A Distributed Graph Engine on a Memory Cloud

Apache Hama Design Document v0.6

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Spark: Cluster Computing with Working Sets

The Power of Relationships

A Comparison of Current Graph Database Models

A1 and FARM scalable graph database on top of a transactional memory layer

Machine Learning over Big Data

Evaluating partitioning of big graphs

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Scaling Out With Apache Spark. DTL Meeting Slides based on

The Current State of Graph Databases

Architectures for massive data management

Big Data Analytics. Lucas Rego Drumond

Unified Big Data Processing with Apache Spark. Matei

CS54100: Database Systems

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Objectivity positions graph database as relational complement to InfiniteGraph 3.0

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

A scalable graph pattern matching engine on top of Apache Giraph

Mining Large Datasets: Case of Mining Graph Data in the Cloud

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

WTF: The Who to Follow Service at Twitter

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This article is the second

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Big Data and Scripting Systems build on top of Hadoop

Unified Big Data Analytics Pipeline. 连 城

Analysis of Web Archives. Vinay Goel Senior Data Engineer

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

A Practical Approach to Process Streaming Data using Graph Database

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Systems and Algorithms for Big Data Analytics

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Search Engine Architecture

NoSQL: Going Beyond Structured Data and RDBMS

Using In-Memory Computing to Simplify Big Data Analytics

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data and Scripting Systems beyond Hadoop

Large Scale Graph Processing with Apache Giraph

How To Scale Out Of A Nosql Database

NoSQL and Hadoop Technologies On Oracle Cloud

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Technology CS , Technion, Spring 2013

Managing large clusters resources

The Internet of Things and Big Data: Intro

Databases 2 (VU) ( )

Spark and the Big Data Library

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Big Data looks Tiny from the Stratosphere

Using Data Mining and Machine Learning in Retail

InfiniteGraph: The Distributed Graph Database

Architectures for Big Data Analytics A database perspective

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Graph Database Proof of Concept Report

Data Processing in the Era of Big Data

Significantly Speed up real world big data Applications using Apache Spark

Report: Declarative Machine Learning on MapReduce (SystemML)

A Brief Study of Open Source Graph Databases

Cloud Computing at Google. Architecture

How Companies are! Using Spark

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Apache HBase. Crazy dances on the elephant back

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Open source Google-style large scale data analysis with Hadoop

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Transcription:

Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012

Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships can be abstracted as graphs.

Graph Computing Everywhere Graph Algorithm: Max Flow (Min Cut). Web Page Integration: Page Rank. Social Network Application: Friendship Mining. Search the results from your social network

Graph - Bottleneck EX: Just like Bing s friendship search. If we want to know our friends friends friends idea (which is a 3 hops of neighborhood) The edges we would like to traverse are: 130 + 130^2 + 130^3 = 2.2M --- ORM can traverse 1,000 relationships in 1 second. Statistics: Huge!!! How to store and How to compute? Type Nodes Edges Size US Road Graph 2.4*10^7 6.0*10^7 788MB Web Graph 2.0*10^10 1.6*10^11 1494GB Facebook Graph 8*10^8 1.0*10^11 787GB

Graph Datastore Basically, graph datastore is database (NoSQL DB) uses graph structures with nodes, edges, and properties to represent and store data, which is highly optimized in the data layout, indexes and query mechanisms. These datastores are more about online query processing, in which low latency is always the core part. (Respond to a web request) EX: HyperGraphDB, Neo4j, FlockDb, Trinity.

Graph Computing System Graph Computing System emphasizes more on the computation model and framework to solve large-scale graph algorithm. These systems are more about the offline analytics, which is aiming at the high throughput. (Graph mining) EX: Pregel, MapReduce, PEGASUS, Trinity.

Graph Datastore Trinity Trinity, a memory-based distributed database and computation platform that supports online query processing and offline analytics on graphs. + Cell based data model. + Global memory addressing. + High performance. - Low scalability.

Graph Datastore FlockDB FlockDB is a distributed graph database for storing adjancency lists. Open source, built upon MySQL, in Twitter. + Partitioned by user_id. + Edges stored in both directions, index by (src, dest). + Optimized query mechanism. (Written in scala) src_id dest_id other 20 12 20 13 20 16 20 18 dest_id src_id other 12 20 12 36 12 40 12 42 Forward Backward

Graph Datastore Others HyperGraphDB HyperGraphDB is a (hyper)graph database designed mostly for knowledge representation, AI and semantic web projects, it can also be used as an embedded object-oriented database for Java projects of all sizes. Neo4j Neo4j storing data in the nodes and relationships of a graph. Disk-based, a powerful traversal framework for high-speed in the node space. Provided APIs on the programming language level (double weight()). Not so good in terms of scalability.

Graph Computing System Vertex-based A computation task is expressed in multiple iterative super-steps and each vertex acts as an independent agent, the vertex-based computation model is a special BSP model. Disadvantage: - Memory limitation. - Network overhead. - Superlinear complexity.

Graph Computing System MR-based Use MapReduce computation framework to obtain scalability and simple programming. PEGASUS discover an important primitive for some graph algorithm. (Matrix-vector multiplication) Linear complexity. Disadvantage: - Totally rethinking for the graph algorithm. - High IO overhead (No global data structure). - Superlinear complexity. - EX: BC on Daytona

Challenge - Locality When traversing the graph, where to access the next node? - Network communication with another machine? - Random read on the disk? Solution in the graph datastore: + Distributed in memory architecture. + Index or inverted index for nodes. + Partition for nodes.

Challenge - Partition How to partition a graph, especially some dynamic graphs like social network? A B Potential solution: + Partition by their centrality. + Replication.

Challenge Network && IO Overhead Vertex-based approach + Machine-to-machine message passing. + Bipartite the graph. MapReduce-based approach + Partition the graph, enhance the locality. + Graph datastore upon the DFS.

Future Work Disk based graph computation model and approach. + Layout mechanism. -> Graph datastore. + Computation mechanism. -> Vertex-based. MR-based. Some systems like Hama, Giraph. + Build upon Hadoop and HDFS. + Adopt the pregel model.

Thanks -- Stay hungry, stay foolish.