SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG

Similar documents
BIG DATA INTEGRATION RESEARCH AT THE UNIVERSITY OF LEIPZIG

Big Data and Analytics: Challenges and Opportunities

COMP9321 Web Application Engineering

FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics

How Companies are! Using Spark

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

SAP HANA From Relational OLAP Database to Big Data Infrastructure

Apache Flink Next-gen data analysis. Kostas

Load-Balancing the Distance Computations in Record Linkage

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

DduP Towards a Deduplication Framework utilising Apache Spark

On a Hadoop-based Analytics Service System

Big Data for Investment Research Management

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Learning-based Entity Resolution with MapReduce

Mining Large Datasets: Case of Mining Graph Data in the Cloud

BIG DATA INTEGRATION AT SCADS DRESDEN/LEIPZIG

Managing large clusters resources

HiBench Introduction. Carson Wang Software & Services Group

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Extend your analytic capabilities with SAP Predictive Analysis

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

What s next for the Berkeley Data Analytics Stack?

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data and Data Science. The globally recognised training program

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Evaluating partitioning of big graphs

Scaling Out With Apache Spark. DTL Meeting Slides based on

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

A Brief Introduction to Apache Tez

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Native Connectivity to Big Data Sources in MSTR 10

BIG DATA AND ANALYTICS

Mastering Big Data. Steve Hoskin, VP and Chief Architect INFORMATICA MDM. October 2015

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

How To Balance In Cloud Computing

Big Graph Processing: Some Background

Big Data Mining Services and Knowledge Discovery Applications on Clouds

How To Use Spagobi Suite

PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Architectures for massive data management

Distributed R for Big Data

Chase Wu New Jersey Ins0tute of Technology

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Big Data for Investment Research Management

UPS battery remote monitoring system in cloud computing

Introduction to Spark

3.1 Solving Systems Using Tables and Graphs

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Business Intelligence in Microservice Architecture. Debarshi bol.com

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

BIG DATA TECHNOLOGY. Hadoop Ecosystem

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Analance Data Integration Technical Whitepaper

How To Use Hadoop For Gis

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

Workshop on Hadoop with Big Data

Hadoop in the Enterprise

Big Data and Scripting Systems build on top of Hadoop

Unified Big Data Processing with Apache Spark. Matei

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

HadoopRDF : A Scalable RDF Data Analysis System

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Big Data and Scripting Systems beyond Hadoop

Big Data Research in Berlin BBDC and Apache Flink

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group


How To Scale Out Of A Nosql Database

Big Data Explained. An introduction to Big Data Science.

LDIF - Linked Data Integration Framework

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.

Transcription:

SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG MARTIN JUNGHANNS, ANDRE PETERMANN, ERHARD RAHM www.scads.de

RESEARCH ON GRAPH ANALYTICS Graph Analytics on Hadoop (Gradoop) Distributed graph data management Rich graph data model with powerful operators Domain independent Business Intelligence with Integrated Instance Graphs (BIIIG) Graph-based data integration Graph OLAP, Mining and visualization Improved Scalability on Gradoop 2

GRAPHS ARE EVERYWHERE AND LARGE Social science Life science Engineering Information science 3

END-TO-END GRAPH ANALYTICS Data Integration Graph Analytics Representation Integrate data from one or more sources into a dedicated graph storage with common graph data model Definition of analytical workflows from operator algebra Result representation in meaningful way 4

An end-to-end framework and research platform for efficient, distributed and domain independent graph data management and analytics. 5

HIGH LEVEL ARCHITECTURE Data flow Control flow Workflow Declaration Visual GrALa DSL Representation Workflow Execution Operator Implementations Data Integration Graph Analytics Representation Extended Property Graph Model HBase Distributed Graph Store HDFS Cluster 6

EXTENDED PROPERTY GRAPH MODEL 7

GRALA OPERATORS 1. Collection Operators Select, Union, Intersect, Difference Sort by, Top, Distinct 2. Graph Operators Pattern Matching Combination, Overlap, Exclusion Aggregation, Summarization, Projection 3. Auxillary Operators Apply, Reduce, Call 8

WORKFLOW EXAMPLE: SUMMARIZATION 1: persongraph = db.g[0].combine(db.g[1]).combine(db.g[2]) 2: vertexgroupingkeys = {:type, city } 3: edgegroupingkeys = {:type} 4: vertexaggfunc = (Vertex vsum, Set vertices => vsum[ count ] = vertices ) 5: edgeaggfunc = (Edge esum, Set edges => esum[ count ] = edges ) 6: sumgraph = persongraph.summarize(vertexgroupingkeys, edgegroupingkeys, vertexaggfunc, edgeaggfunc) 9

WORKFLOW EXAMPLE: SUMMARIZATION 1: persongraph = db.g[0].combine(db.g[1]).combine(db.g[2]) 2: vertexgroupingkeys = {:type, city } 3: edgegroupingkeys = {:type} 4: vertexaggfunc = (Vertex vsum, Set vertices => vsum[ count ] = vertices ) 5: edgeaggfunc = (Edge esum, Set edges => esum[ count ] = edges ) 6: sumgraph = persongraph.summarize(vertexgroupingkeys, edgegroupingkeys, vertexaggfunc, edgeaggfunc) 10

SUMMARY & ROADMAP: GRADOOP Summary end-to-end framework for graph data management and analytics extended property graph model (EPGM) with powerful operators initial implementation running (HBase, MapReduce and Giraph) Roadmap WIP: workflow execution layer (Flink, Spark, ) WIP: reference implementation for all operators optimized graph partitioning approaches graph-based data integration (DeDoop) 11

BIIIG ON GRADOOP Fitting data model Complex Analytics composed of Gradoop Operators Example: Cluster Characteristic Patterns in Business Process Executions Quantify clusters by business measure (e.g., profitable and lossy) Characteristic = frequent within one but not in other clusters 12

BIIIG OVERVIEW 13

BUSINESS TRANSACTION GRAPH 14

CLUSTER-CHARACTERISTIC PATTERNS 15

CLUSTER-CHARACTERISTIC PATTERNS // generate base collection btgs = iig.callforcollection( :BusinessTransactionGraphs, {} ) 16

CLUSTER-CHARACTERISTIC PATTERNS 17

CLUSTER-CHARACTERISTIC PATTERNS // generate base collection btgs = iig.callforcollection( :BusinessTransactionGraphs, {} ) // aggregate profit aggfunc = ( Graph g => g.v.values( Revenue").sum() - g.v.values( Expense").sum() ) 18

CLUSTER-CHARACTERISTIC PATTERNS 19

CLUSTER-CHARACTERISTIC PATTERNS // generate base collection btgs = iig.callforcollection( :BusinessTransactionGraphs, {} ) // aggregate profit aggfunc = ( Graph g => g.v.values( Revenue").sum() - g.v.values( Expense").sum() ) btgs = btgs.apply( Graph g => g.aggregate( Profit, aggfunc ) ) 20

CLUSTER-CHARACTERISTIC PATTERNS 21

CLUSTER-CHARACTERISTIC PATTERNS 22

CLUSTER-CHARACTERISTIC PATTERNS // specific projection vertexfunc = (Vertex v => new Vertex( (v[ IsMasterData ]? v[ SourceID ] : v[:type]), { Result :v[ Result ]} ) edgefunc = (Edge e => new Edge( (e[:type]), {} ) btgs = btgs.apply( Graph g => g.project( vertexfunc, edgefunc ) ) 23

CLUSTER-CHARACTERISTIC PATTERNS 24

CLUSTER-CHARACTERISTIC PATTERNS // select profit and loss clusters proftitbtgs = btgs.select( Graph g => g[ Result ] >= 0 ) lossbtgs = btgs.difference(profitbtgs) 25

CLUSTER-CHARACTERISTIC PATTERNS 26

CLUSTER-CHARACTERISTIC PATTERNS // select profit and loss clusters proftitbtgs = btgs.select( Graph g => g[ Result ] >= 0 ) lossbtgs = btgs.difference(profitbtgs) profitfreqpats = proftitbtgs.callforcollection( :FrequentSubgraphs, { Threshold :0.7} ) lossfreqpats = lossbtgs.callforcollection( :FrequentSubgraphs, { Threshold :0.7} ) // determine cluster characteristic patterns trivialpats = profitfreqpats.intersect(lossfreqpats) profitcharpatterns = profitfreqpats.difference(trivialpats) losscharpatterns = lossfreqpats.difference(trivialpats) 27

SUMMARY & ROADMAP: BIIIG Summary Graph-based business intelligence framework Graph transformations of business information systems Concept of Business Transaction Graphs Roadmap WIP: distributed frequent pattern mining Summarization-based Graph OLAP Meaningful result representation Real-world evaluation 28

REFERENCES Junghanns, M., Petermann, A., Gomez, K., Peukert, E., Rahm, E.: GRADOOP - Scalable Graph Data Management and Analytics with Hadoop. Tech. report, Univ. of Leipzig, June 2015 Kolb L., E. Rahm: Parallel Entity Resolution with Dedoop. Datenbank-Spektrum 13(1): 23-32 (2013) Kolb L., A. Thor, E. Rahm: Dedoop: Efficient Deduplication with Hadoop. PVLDB 5(12), 2012 Kolb L., A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012: 618-629 Kolb L., Z. Sehili, E. Rahm: Iterative Computation of Connected Graph Components with MapReduce. Datenbank-Spektrum 14(2): 107-117 (2014) Petermann A., M. Junghanns, R. Müller, E. Rahm: BIIIG : Enabling Business Intelligence with Integrated Instance Graphs. Proc. 5th Int. Workshop on Graph Data Management (GDM 2014) Petermann A., M. Junghanns, R. Müller, E. Rahm: Graph-based Data Integration and Business Intelligence with BIIIG. Proc. VLDB Conf., 2014 Petermann, A.; Junghanns, M.; Müller, R.; Rahm, E.: FoodBroker - Generating Synthetic Datasets for Graph- Based Business Analytics. Proc. 5th Int. Workshop on Big Data Benchmarking (WBDB), 2014 Jindal, A. et.al.: Vertexica: your relational friend for graph analytics!. PVLDB 7(13), 2014 Rudolf, M. et.al.: The Graph Story of the SAP HANA Database. BTW, 2013 29

Thank you! www.gradoop.org www.biiig.org 30