GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms



Similar documents
Professor and Director, Division of Biological Sciences & Computation Institute, University of Chicago

Towards an Optimized Big Data Processing System

Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Datacenters

Scalable Architecture on Amazon AWS Cloud

Understanding Enterprise NAS

Scheduling in IaaS Cloud Computing Environments: Anything New? Alexandru Iosup Parallel and Distributed Systems Group TU Delft

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Big Data Mining Services and Knowledge Discovery Applications on Clouds

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Information Architecture

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

OPTIMIZED PERFORMANCE EVALUATIONS OF CLOUD COMPUTING SERVERS

C-Meter: A Framework for Performance Analysis of Computing Clouds

Big Data Performance Growth on the Rise

Concept and Project Objectives

Databricks. A Primer

InfiniteGraph: The Distributed Graph Database

BEDIFFERENT A C E I N T E R N A T I O N A L

Hands-On Microsoft Windows Server 2008

Life Insurance & Big Data Analytics: Enterprise Architecture

Rackspace Cloud Databases and Container-based Virtualization

Reference Architecture, Requirements, Gaps, Roles

In-Memory Data Management for Enterprise Applications

Automating Big Data Benchmarking for Different Architectures with ALOJA

Emerging Technology for the Next Decade

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

What is Analytic Infrastructure and Why Should You Care?

Implement Hadoop jobs to extract business value from large and varied data sets

Energy Efficient MapReduce

EMC NETWORKER AND DATADOMAIN

Software-defined Storage Architecture for Analytics Computing

A Measurement of NAT & Firewall Characteristics in Peer to Peer Systems

MySQL. Leveraging. Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli

Serving 4 million page requests an hour with Magento Enterprise

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

BEDIFFERENT ACE G E R M A N Y. aras.com. Copyright 2012 Aras. All Rights Reserved.

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Unified Big Data Processing with Apache Spark. Matei

Elasticity in Multitenant Databases Through Virtual Tenants

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Databricks. A Primer

Cluster, Grid, Cloud Concepts

CLOUDFORMS Open Hybrid Cloud

SQL Server 2012 Performance White Paper

A Brief Introduction to Apache Tez

Bringing Big Data Modelling into the Hands of Domain Experts

Performance Management for Cloud-based Applications STC 2012

Managing Peak Loads by Leasing Cloud Infrastructure Services from a Spot Market

How To Win At A Game Of Monopoly On The Moon

Achieving Zero Downtime and Accelerating Performance for WordPress

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

Evaluation Methodology of Converged Cloud Environments

Hardware/Software Guidelines

Green-Marl: A DSL for Easy and Efficient Graph Analysis

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

IJRSET 2015 SPL Volume 2, Issue 11 Pages: 29-33

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

INCREASING THE CLOUD PERFORMANCE WITH LOCAL AUTHENTICATION

Building Success on Acquia Cloud:

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Big Data Analytics - Accelerated. stream-horizon.com

Part I Courses Syllabus

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Leveraging EMC Fully Automated Storage Tiering (FAST) and FAST Cache for SQL Server Enterprise Deployments

Introducing EEMBC Cloud and Big Data Server Benchmarks

Big Data Web Analytics Platform on AWS for Yottaa

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

An Approach Towards Customized Multi- Tenancy

Database Server Configuration Best Practices for Aras Innovator 10

How is a rational (big) data deployment approach like optimizing the generation mix of a power company?

Future Prospects of Scalable Cloud Computing

Big Data Management in the Clouds and HPC Systems

Transcription:

GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms Mihai Capotã, Yong Guo, Ana Lucia Varbanescu, Tim Hegeman, Jorai Rijsdijk, Alexandru Iosup, GRAPHALYTICS was made possible by a generous contribution from Oracle. Jose Larriba Pey, Arnau Prat, Peter Boncz, Hassan Chafi Graphalytics: Benchmarking Graph-Processing Platforms LDBC TUC Meeting Barcelona, Spain, March 2015 1

(TU) Delft the Netherlands Europe founded 13 th century pop: 100,000 Delft pop.: 100,000 founded 1842 pop: 13,000 pop: 16.5 M Barcelona

The Parallel and Distributed Systems Group at TU Delft VENI VENI VENI Alexandru Iosup Grids/Clouds P2P systems Big Data/graphs Online gaming Home page www.pds.ewi.tudelft.nl Publications Dick Epema Grids/Clouds P2P systems Video-on-demand e-science Ana Lucia Varbanescu (now UvA) HPC systems Multi-cores Big Data/graphs Henk Sips HPC systems Multi-cores P2P systems @large Johan Pouwelse P2P systems File-sharing Video-on-demand see PDS publication database at publications.st.ewi.tudelft.nl Winners IEEE TCSC Scale Challenge 2014 3

Graphs at the Core of Our Society: The LinkedIn Example A very good resource for matchmaking workforce and prospective employers Vital for your company s life, as your Head of HR would tell you Vital for the prospective employees Feb 2012 100M Mar 2011, 69M May 2010 Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/ via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/ 4

Graphs at the Core of Our Society: The LinkedIn Example Apr 2014 Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/ via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/ 5

The data deluge: large-scale graphs 300M users??? edges 270M MAU 200+ avg followers >54B edges 1.2B MAU 0.8B DAU 200+ avg followers >240B edges 6

The data deluge: large-scale graphs Oracle 1.2M followers, 132k employees company/day: 40-60 posts, 500-700 comments 270M MAU 200+ avg followers >54B edges 1.2B MAU 0.8B DAU 200+ avg followers >240B edges 7

The data deluge: large-scale graphs Oracle 1.2M followers, 132k employees Data-intesive workload company/day: 40-60 posts, 500-700 comments 270M MAU 200+ avg followers >54B edges 10x graph size 100x 1,000x slower 1.2B MAU 0.8B DAU 200+ avg followers >240B edges 8

The data deluge: large-scale graphs Oracle 1.2M followers, 132k employees Data-intesive workload company/day: 40-60 posts, 500-700 comments 270M MAU 200+ avg followers >54B edges 10x graph size 100x 1,000x slower Compute-intesive workload more complex analysis?x slower 1.2B MAU 0.8B DAU 200+ avg followers >240B edges 9

The data deluge: large-scale graphs Oracle 1.2M followers, 132k employees Data-intesive workload company/day: 40-60 posts, 500-700 comments 270M MAU 200+ avg followers >54B edges 10x graph size 100x 1,000x slower Compute-intesive workload more complex analysis?x slower 1.2B MAU 0.8B DAU 200+ avg followers >240B edges Dataset-dependent workload unfriendly graphs??x slower 10

Graphs at the Core of Our Society: The LinkedIn Example 3-4 new users every second Great, if you can process this graph: opinion mining, hub detection, etc. but fewer visitors (and page views) 300,000,000 100+ million questions of customer retention, of (lost) customer influence, of... Apr 2014 Feb 2012 100M Mar 2011, 69M May 2010 Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/ via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/ 11

Graphs at the Core of Our Society: The LinkedIn Example 3-4 new users every second Great, if you can process this graph: opinion mining, hub detection, etc. Periodic and/or continuous 300,000,000 100+ analytics million questions of customer retention, at of full (lost) scale customer influence, of... but fewer visitors (and page views) Apr 2014 Feb 2012 100M Mar 2011, 69M May 2010 Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/ via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/ 12

The sorry, but moment Supporting multiple users 10x number of users????x slower

Graph Processing @large ETL A Graph Processing Platform Distribution to processing platform Algorithm Active Storage (filtering, compression, replication, caching) Interactive processing not considered in this presentation. Streaming not considered in this presentation. 16

Graph Processing @large Ideally, ETL N cores/disks Nx faster Active Storage (filtering, compression, replication, caching) A Graph Processing Platform Distribution to processing platform Ideally, Algorithm N cores/disks Nx faster Interactive processing not considered in this presentation. Streaming not considered in this presentation. 17

Graph-Processing Platforms Platform: the combined hardware, software, and programming system that is being used to complete a graph processing task Which to choose? 2 What to tune? Trinity 18

What is the performance of graph-processing platforms? Metrics Diversity Graph500 Graph Diversity Single application (BFS), Single class of synthetic datasets Few existing platform-centric comparative studies Prove the superiority of a given system, limited set of metrics GreenGraph500, GraphBench, XGDBench Representativeness, systems covered, metrics, Algorithm Diversity Graphalytics = comprehensive benchmarking suite for graph processing across all platforms 19

Graphalytics = A Challenging Benchmarking Process Methodological challenges Challenge 1. Evaluation process Challenge 2. Selection and design of performance metrics Challenge 3. Dataset selection and analysis of coverage Challenge 4. Algorithm selection and analysis of coverage Practical challenges Challenge 5. Scalability of evaluation, selection processes Challenge 6. Portability Challenge 7. Result reporting Y. Guo, A. L. Varbanescu, A. Iosup, C. Martella, T. L. Willke: Benchmarking graph-processing platforms: a vision. ICPE 2014: 289-292

Graphalytics = Many Classes of Algorithms Literature survey of of metrics, datasets, and algorithms 10 top research conferences: SIGMOD, VLDB, HPDC Key word: graph processing, social network 2009 2013, 124 articles Class Examples % Graph Statistics Diameter, PageRank 16.1 Graph Traversal BFS, SSSP, DFS 46.3 Connected Component Reachability, BiCC 13.4 Community Detection Clustering, Nearest Neighbor 5.4 Graph Evolution Forest Fire Model, PAM 4.0 Other Sampling, Partitioning 14.8 Future work Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis, IPDPS 14. 21

Graphalytics = Real & Synthetic Datasets Interaction graphs (possible work) The Game Trace Archive https://snap.stanford.edu/ http://www.graph500.org/ http://gta.st.ewi.tudelft.nl/ Y. Guo and A. Iosup. The Game Trace Archive, NETGAMES 2012. LDBC Social Network Generator 22

Graphalytics = Advanced Harness Cloud support technically feasible, methodologically difficult 23

Workload, SIGMOD 15 Graphalytics = Enhanced LDBC Datagen A battery of graphs covering a rich set of configurations Datagen extensions to More diverse degree distributions Clustering coefficient and assortativity Ongoing work LDBC D3.3.34 http://ldbcouncil.org/sites/default/files/ldbc_d3.3.34.pdf and Orri Erling et al. The LDBC Social Network Benchmark: Interactive 24

Diverse metrics: CPU, IOPS, Network, Memory use, Graphalytics = Advanced Monitoring & Logging System Time Automatic analysis matching the programming model Ongoing work A. Iosup et al., Towards Benchmarking IaaS and PaaS Clouds for Graph Analytics. WBDB 2014 25

Graphalytics = Choke-Point Analysis Choke points are crucial technological challenges that platforms are struggling with Examples Network traffic Access locality Skewed execution Challenge: Select benchmark workload based on real-world scenarios, but make sure they cover the important choke points near-future work 26

Graphalytics = Advanced Software Engineering Process https://github.com/mihaic/graphalytics/ All significant modifications to Graphalytics are peer-reviewed by developers Internal release to LDBC partners (Feb 2015) Public release, announced first through LDBC (Apr 2015*) Jenkins continuous integration server SonarQube software quality analyzer 27

Graphalytics in Practice 6 real-world datasets + 2 synthetic generators Data ingestion not included here! Many more metrics supported 10 platforms tested w prototype implementation 5 classes of algorithms Missing results = failures of the respective systems 28

Key Findings So Far Performance is function of (Dataset, Algorithm, Platform, Deployment) Previous performance studies lead to tunnel vision Platforms have their specific drawbacks (crashes, long execution time, tuning, etc.) Best-performing system depends on stakeholder needs Some platforms can scale up reasonably with cluster size (horizontally) or number of cores (vertically) Strong vs weak scaling still a challenge workload scaling tricky Single-algorithm is not workflow/multi-tenancy Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Towards Benchmarking Graph-Processing Platforms Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis,IPDPS 14. 29

Thank you for your attention! Comments? Questions? Suggestions? http://graphalytics.ewi.tudelft.nl https://github.com/mihaic/graphalytics/ PELGA 2015, May 15 http://sites.google.com/site/pelga2015/ Alexandru Iosup A.Iosup@tudelft.nl GRAPHALYTICS was made possible by a generous contribution from Oracle. Contributors 30

A few extra slides 31

Discussion How much preprocessing should we allow in the ETL phase? How to choose a metric that captures the preprocessing? http://graphalytics.ewi.tudelft.nl 32

Discussion How should we asses the correctness of algorithms that produce approximate results? Are sampling algorithms acceptable as trade-off time to benchmark vs benchmarking result? http://graphalytics.ewi.tudelft.nl 33

Discussion How to setup the platforms? Should we allow algorithm-specific platform setups or should we require only one setup to be used for all algorithms? http://graphalytics.ewi.tudelft.nl 34

Discussion Towards full use cases, full workflows, and inter-operation of big data processing systems How to benchmark the entire chain needed to produce useful results, perhaps even the human in the loop? http://graphalytics.ewi.tudelft.nl A. Iosup, T. Tannenbaum, M. Farrellee, D. H. J. Epema, M. Livny: Interoperating grids through Delegated MatchMaking. Scientific Programming 16(2-3): 233-253 (2008) 35