A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader



Similar documents
A Brief Study of Open Source Graph Databases

A Performance Evaluation of Open Source Graph Databases

Review of Graph Databases for Big Data Dynamic Entity Scoring

Big Graph Processing: Some Background

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

HIGH PERFORMANCE BIG DATA ANALYTICS

STINGER: High Performance Data Structure for Streaming Graphs

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Unified Big Data Processing with Apache Spark. Matei

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Spark and the Big Data Library

Graph Processing and Social Networks

Information Processing, Big Data, and the Cloud

Hadoop Ecosystem B Y R A H I M A.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Scaling Out With Apache Spark. DTL Meeting Slides based on

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

NoSQL for SQL Professionals William McKnight

Machine Learning over Big Data

Evaluating partitioning of big graphs

Large-Scale Data Processing

Moving From Hadoop to Spark

MapReduce with Apache Hadoop Analysing Big Data

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Big Data and Scripting Systems build on top of Hadoop

Hadoop SNS. renren.com. Saturday, December 3, 11

Microblogging Queries on Graph Databases: An Introspection

Fast Data in the Era of Big Data: Twitter s Real-

Large-Scale Test Mining

Challenges for Data Driven Systems

Big Systems, Big Data

Apache Hama Design Document v0.6

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Distributed Computing and Big Data: Hadoop and MapReduce

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Graph Databases: Neo4j

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Can the Elephants Handle the NoSQL Onslaught?

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Hadoop and Map-Reduce. Swati Gore

1-Oct 2015, Bilbao, Spain. Towards Semantic Network Models via Graph Databases for SDN Applications

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Graph Database Proof of Concept Report

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction To Hive

Open is as Open Does: Lessons from Running a Professional Open Source Company

Big Data Processing. Patrick Wendell Databricks

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Development at the Speed and Scale of Google. Ashish Kumar Engineering Tools

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

An Approach to Implement Map Reduce with NoSQL Databases

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

Similarity Search in a Very Large Scale Using Hadoop and HBase

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

THE HADOOP DISTRIBUTED FILE SYSTEM

Systems and Algorithms for Big Data Analytics

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Data processing goes big

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Unified Big Data Analytics Pipeline. 连 城

Analysis of MapReduce Algorithms

Introduction to DISC and Hadoop

Apache HBase. Crazy dances on the elephant back

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3

Spark. Fast, Interactive, Language- Integrated Cluster Computing

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Xiaoming Gao Hui Li Thilina Gunarathne

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Data-Flow Awareness in Parallel Data Processing

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Data With Hadoop

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

File System Management

InfiniteGraph: The Distributed Graph Database

Apache Flink Next-gen data analysis. Kostas

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Networking in the Hadoop Cluster

Transcription:

A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Overview Motivation Options Evaluation Results Lessons Learned Moving Forward 2

Massive Streaming Semantic Graphs Features Millions to billions of vertices and edges with rich semantic information (name, type, weight, time), possibly missing or inconsistent data Thousands to millions of updates per second Power-law degree distribution, sparse (d(v) ~ O(1)), low diameter Financial NYSE processes >2TB daily, maintains 10 s of PB Social 50,000+ Facebook Likes per second, 1.2B users 6,000 Twitter Tweets per second, 500M users Google Several dozen 1PB data sets Knowledge Graph: 570M entities, 18B relationships Business ebay: >17 trillion records, 5B new records per day 3

Given these graphs How do we store them? Memory vs. Disk ACID vs. loose consistency Single node vs. distributed Semantic and temporal information Numeric IDs How do we query them? Simple neighbor queries Traversal methods, filtering Programming models vs. query languages How do we compute meaningful answers? Algorithms and implementation paradigms BSP, MapReduce, MPI, OpenMP, single threaded C, C++, Python, Java, Scala, Javascript, SQL, Gremlin 4

The Options * see page 2 of the paper * DEX is now Sparksee 5

Evaluation Four fundamental graph kernels Prefer a particular implementation Emphasize common programming / traversal styles Measure performance in multiple ways Time to completion Memory in use (MemoryMXBean for Java, OS reporting otherwise) Qualitative analysis of capabilities and development experience If not provided by package, written by naïve programmer Same R-MAT graphs used with all packages Vertices: 1K (tiny), 32K (small), 1M (medium), and 16M (large) Edges: 8K (tiny), 256K (small), 8M (medium), 128M (large) Graphs of 1B edges considered, but not included due to number of systems that couldn t work with large 6

Evaluation Single Source Shortest Paths (SSSP) Must be a breadth-first traversal of the graph Level-synchronous parallel where possible Output unweighted distances from source to all vertices Application Used for routing and connectivity Building block for other algorithms (e.g. betweenness centrality) Graph 500 benchmark 7

Evaluation Connected Components Based on Shiloach-Vishkin Edge-parallel label-pushing style Global graph metric that touches all edges in the graph Properties Not theoretically work efficient, but edge-parallel pattern is representative of a broad class of graph algorithms Read-heavy with sparse writes Cache friendly graph traversal with cache-unfriendly random label read / assignment 8

Evaluation PageRank centrality algorithm Vertex-parallel Bulk Synchronous Parallel (BSP) power-iteration style Measure of influence over information flow and importance in the network Representative of a broad class of iterative graph algorithms Update benchmark Perform edge insertions and deletions in parallel Random access and modification of the structure Real-world networks are in constant motion as new edges and vertices enter the graph 9

DISCLAIMERS We are not expert Java programmers, Hadoop tuners/ magicians, Python gurus (maybe a few), Scala wizards We are some of the primary maintainers of the STINGER graph package The code for these implementations is available at github.com/robmccoll/graphdb-testing Test code is BSD, package licenses vary We apologize if you feel that your particular work has been slighted or misrepresented If any of these packages belong to you, we invite and encourage you to improve on our results and submit them back to the repository 10

Maximum Size Completed Package Tiny Small Medium Large boost X X X X dex X X mtgl X X X X neo4j X X networkx X X X orientdb X X X sqlite X X s=nger X X X X =tan X X X bagel X X X pegasus X X X X giraph X X X X Determined by crashing, running out of memory, or running an algorithm for longer than 24 hours. Exception: Dex had a 1 million element ( V + E ) limit. 11

Memory Usage 3 orders of magnitude difference in memory usage 12

Page Rank Performance 100000 Small Range 10,000x 10000 small medium large Medium (x32) Range 10,000x 40~90x slower Except Pegasus, Giraph Time (s) 1000 100 10 Large (x1024) Range 100x ~20x slower Except Pegasus 1 0.1 0.01 13

SSSP and Components Small vs. Medium 14

Observations on Quantitative Big performance gaps, possible groups High Performance (ms response) Medium Performance (sub minute response) Storage Performance Perhaps suited for storage, but not analysis Not language or system, just not written with graph algorithms in mind Some parallel, but many not Graph libraries generally parallel Not much parallelism in (single node) disk-backed DBs Parallelism only increases random read and write Bottleneck is disk bandwidth and latency Data scaling is hard to predict Full results available at: github.com/robmccoll/graphdb-testing arxiv.org/abs/1309.2675 15

Qualitative Observation Multitude of input file formats supported No universal support XML, JSON, CSV, or other proprietary binary and plain-text formats Trade-off between size, descriptiveness, and flexibility Challenge for data exchange Edge-based, delimited, and self-describing easily parsed in parallel, translated to other formats Same data, same algorithms, same HW, DB / library is the difference No consensus on ACID / Transactions No clear result on its affect on speed Not clear that the applications exist No consensus on query languages, models 16

Moving Forward To advance HPC outside of our community, software must be accessible Some applications don t compile out of the box Self-contained packages are easiest Package management systems don t work well on nodes inside a cluster Build problems drive users away Lack of consistent approaches to documentation Well-written tutorials better than sparse full API doc Lack of adequate usage examples Inserting vertices and edges, querying for neighbors is not enough What is the best way to traverse the structure? How should data ingest, analysis, and extracting results be performed? Highlights NetworkX is very well documented, both examples and API doc Titan s API doc is very complete, but lacks examples of real-world usage Bagel, STINGER lack formal documentation or have fragmented and occasionally inaccurate documentation 17

Acknowledgment of Support 18