! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)



Similar documents
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Graph Database Proof of Concept Report

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

How graph databases started the multi-model revolution

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Cray: Enabling Real-Time Discovery in Big Data

Big Graph Data Management

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Integrating Apache Spark with an Enterprise Data Warehouse

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

bigdata Managing Scale in Ontological Systems

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Moving From Hadoop to Spark

E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 bigdata Presented to CSHALS 2/27/2014

Towards Smart and Intelligent SDN Controller

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Graph Database Performance: An Oracle Perspective

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe

Apache Flink Next-gen data analysis. Kostas

G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Challenges for Data Driven Systems

YarcData urika Technical White Paper

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Lecture Data Warehouse Systems

Review of Graph Databases for Big Data Dynamic Entity Scoring

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Semantic Stored Procedures Programming Environment and performance analysis

An empirical comparison of graph databases

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark

Unified Big Data Processing with Apache Spark. Matei

InfiniteGraph: The Distributed Graph Database

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

PUBLIC Performance Optimization Guide

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Parallel Replication for MySQL in 5 Minutes or Less

Microblogging Queries on Graph Databases: An Introspection

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Performance and Scalability Overview

The Synergy Between the Object Database, Graph Database, Cloud Computing and NoSQL Paradigms

Oracle Big Data Spatial and Graph

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Performance and Scalability Overview

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Oracle Big Data SQL Technical Update

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

urika! Unlocking the Power of Big Data at PSC

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

Big Systems, Big Data

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Analytics - Accelerated. stream-horizon.com

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

Big RDF Data Partitioning and Processing using hadoop in Cloud

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Big Graph Processing: Some Background

A Brief Introduction to Apache Tez

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

H2O on Hadoop. September 30,

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

1-Oct 2015, Bilbao, Spain. Towards Semantic Network Models via Graph Databases for SDN Applications

Large-Scale Data Processing

Data sharing in the Big Data era

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Performance And Scalability In Oracle9i And SQL Server 2000

MyOra 3.5. User Guide. SQL Tool for Oracle. Kris Murthy

In Memory Accelerator for MongoDB

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

! E6893 Big Data Analytics Lecture 11:! Linked Big Data Graphical Models (II)

LDIF - Linked Data Integration Framework

OntoDBench: Ontology-based Database Benchmark

In-Memory Data Management for Enterprise Applications

Rackspace Cloud Databases and Container-based Virtualization

A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.

Data processing goes big

imgraph: A distributed in-memory graph database

FastForward I/O and Storage: ACG 6.6 Demonstration

Accelerating Cassandra Workloads using SanDisk Solid State Drives

A Practical Approach to Process Streaming Data using Graph Database

Mining Large Datasets: Case of Mining Graph Data in the Cloud

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Transcription:

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center October 30th, 2014

Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms -- I 10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Big Data Analytics Algorithms V (classification & clustering) 10/30/14 9 Linked Big Data Graph Computing I 11/06/14 10 Linked Big Data Graph Computing II 11/13/14 11 Hardware, Processors, and Cluster Platforms 11/20/14 12 Big Data Next Challenges IoT, Cognition, and Beyond 11/27/14 Thanksgiving Holiday 12/04/14 13 Final Projects Discussion (Optional) 12/11/14 & 12/12/14 14-15 Two-Day Big Data Analytics Workshop Final Project Presentations 2

Forrester: Over 25% of enterprise will use Graph DB by 2017 TechRadar: Enterprise DBMS, Q12014 Graph DB is in the significant success trajectory, and with the highest business value in the upcoming DBs. System G Team 2014 IBM Corporation

4 Source: Dr. Toyotaro Suzumura, ICPE2014 E6893 Big keynote Data Analytics Lecture 9: Linked Big Data: Graph Computing

Graph Computation Step 1: Computations on graph structures / topologies Example converting Bayesian network into junction tree, graph traversal (BFS/DFS), etc. Characteristics Poor locality, irregular memory access, limited numeric operations! Step 2: Computations on graphs with rich properties Example Belief propagation: diffuse information through a graph using statistical models Characteristics Locality and memory access pattern depend on vertex models Typically a lot of numeric operations Hybrid workload Step 3: Computations on dynamic graphs Characteristics Poor locality, irregular memory access Operations to update a model (e.g., cluster, sub-graph) 3-core subgraph Hybrid workload 5

Parallel Inference Parallelism at Multiple Granularity Levels Absorb Marginalize Extend Multiply/Divide Marginalization 6

Large-Scale Graph Benchmark Graph 500 7 IBM BlueGene or P775: 24 out of Top 30, except #4, #14, #22: Fujitsu (algorithm is from IBM) #6: Tianhe #15: Cray #24: HP

ScaleGraph algorithms (IBM System G s open source) made Top #1 in Graph 500 benchmark 8 Source: Dr. Toyotaro Suzumura, ICPE2014 keynote

RDF and SPARQL 9

RDF and SPARQL 10

Resource Description Format (RDF) A W3C standard sicne 1999 Triples Example: A company has nince of part p1234 in stock, then a simplified triple rpresenting this might be {p1234 instock 9}. Instance Identifier, Property Name, Property Value. In a proper RDF version of this triple, the representation will be more formal. They require uniform resource identifiers (URIs). 11

An example complete description 12

Advantages of RDF Virtually any RDF software can parse the lines shown above as self-contained, working data file. You can declare properties if you want. The RDF Schema standard lets you declare classes and relationships between properties and classes. The flexibility that the lack of dependence on schemas is the first key to RDF's value.! Split trips into several lines that won't affect their collective meaning, which makes sharding of data collections easy. Multiple datasets can be combined into a usable whole with simple concatenation.! For the inventory dataset's property name URIs, sharing of vocabulary makes easy to aggregate. 13

SPARQL Query Langauge for RDF The following SPQRL query asks for all property names and values associated with the fbd:s9483 resource: 14

The SPAQRL Query Result from the previous example 15

Another SPARQL Example What is this query for? Data 16

Open Source Software Apache Jena 17

Property Graphs 18

Reference 19

A usual example 20

Query Example I 21

Query Examples II & III 22 Computational intensive

Graph Database Example 23

Executation Time in the example of finding extended friends (by Neo4j) 24

Modeling Order History as a Graph 25

A query language on Property Graph Cypher 26

Cypher Example 27

Other Cypher Clauses 28

Property Graph Example Shakespheare 29

Creating the Shakespeare Graph 30

Query on the Shakespear Graph 31

Another Query on the Shakespear Graph 32

Chaining on the Query 33

Example Email Interaction Graph What's this query for? 34

Building Application Example Collaborative Filtering 35

How to make graph database fast? 36

Use Relationships, not indexes, for fast traversal 37

Storage Structure Example 38

Nodes and Relationships in the Object Cache 39

An Emerging Benchmark Test Set: data generator of full social media activity simulation of any number of users 40

TinkerPop Stack and choices of Open Source Graph DBs JVM API + Stack for Graph Includes: - Gremlin traversal language - - BluePrints JDBC of Graph DB s - BYO persistence graph store: - Growing community - Ex. Titan, Neo4J, InfiniteGraph! - But - It is still work in progress - - analytics (Furnace) are not yet implemented - - No support for pushing work to data nodes - - Big data support is not clear Titan by Aurelius Apache2 license Uses HBase, Berkeley DB, or Cassandra Includes Rexster, Frames, Gremlin, Blueprints Recently added: Faunus(M/R). RDF, bulk load Neo4J GraphDB GPLv3 or commercial license Fast Graph must fit on each cluster node 41

A recent experiment Dataset: 12.2 million edges, 2.2 million vertices Goal: Find paths in a property graph. One of the vertex property is call TYPE. In this scenario, the user provides either a particular vertex, or a set of particular vertices of the same TYPE (say, "DRUG"). In addition, the user also provides another TYPE (say, "TARGET"). Then, we need find all the paths from the starting vertex to a vertex of TYPE TARGET. Therefore, we need to 1) find the paths using graph traversal; 2) keep trace of the paths, so that we can list them after the traversal. Even for the shortest paths, it can be multiple between two nodes, such as: drug->assay->target, drug->moa->target First test (coldstart) Avg time (100 tests) Requested depth 5 traversal Requested full depth traversal IBM System G (NativeStore C++) 39 sec 3.0 sec 4.2 sec IBM System G (NativeStore JNI) 57 sec 4.0 sec 6.2 sec Java Neo4j (Blueprints 2.4) 105 sec 5.9 sec 8.3 sec Titan (Berkeley DB) 3861 sec 641 sec 794 sec Titan (HBase) 3046 sec 1597 sec 2682 sec 42 First full test - full depth 23. All data pulled from disk. Nothing initially cached. Modes - All tests in default modes of each graph implementation. Titan can only be run in transactional mode. Other implementations do not default to transactional mode.

IBM System G Native Store Overview System G Native store represents graphs in-memory and on-disk Organizing graph data for representing a graph that stores both graph structure and vertex properties and edge properties Caching graph data in memory in either batch-mode or on-demand from the on-disk streaming graph data Accepting graph updates and modifying graph structure and/or property data accordingly and incorperating time stamps Add edge, remove vertex, update property, etc. Persisting graph updates along with the time stamps from in-memory graph to on-disk graph Performing graph queries by loading graph structure and/or property data Find neighbors of a vertex, retrieve property of an edge, traverse a graph, etc.!!! Graph data Graph queries graph engine In-memory cached graph Time stamp control on-disk persistent graph 43

On-Disk Graph Organization Native store organizes graph data for representing a graph with both structure and the vertex properties and edge properties using multiple files in Linux file system Creating a list called ID Offset where each element translates a vertex (edge) ID into two offsets, pointing to the earliest and latest data of the vertex/edge, respectively Creating a list called Time_stamp Offset where each element has a time stamp, an offset to the previous time stamp of the vertex/edge, and a set of indices to the adjacent edge list and properties Create a list of chained block list to store adjacent list and properties! On-disk persistent graph: 44

Cache-Amenability of In-memory Data Structure Graph computations is notorious for highly irregularity in data access patterns, resulting in poor data locality for cache performance Data locality still exists in both graph structure and properties Should we leverage the limited data locality and organize the graph data layout accordingly? 45

Reducing Disk Accesses by Caching Cache graph data in memory either by: Caching in batch-mode leads to keep a chunk of the graph data files (lists) in memory for efficient data retrieval Caching on-demand loads elements from on-disk files into memory only when the vertices and/or edges are accessed Stitching elements from different data files together according to their VID or EID inmemory element, so as to increase data locality Behaving as a in-memory database when memory is sufficient to host all graph data In-memory graph data! Keep the offsets for write through 46 On-disk persistent graph files

Property Bundles Column Family The properties of a vertex (or an edge) can be stored in a set of property files, each called a property bundle, which consists of one or multiple properties that will be read/ write all together and therefore results in the following characteristics: Organizing properties that are usually used together into a single bundle for reducing disk IO Organizing properties not usually used together into different bundles for reducing disk band width Allowing dynamic all/delete properties to any property bundle 47

Impact from Storage Hardware Convert csv file (adams.csv 20G) to datastore Similar performance: 7432 sec versus 7806 sec CPU intensive Average CPU util.: 97.4 versus 97.2 I/O pattern Maximum read rate: 5.0 vs. 5.3 Maximum write rate: 97.7 vs. 85.3 Ratio HDD/SDD SSD offers consistently higher performance for both read and write TYPE1 TYPE2 TYPE3 TYPE4 13.79 6.36 19.93 2.44 Queries Type 1: find the most recent URL and PCID of a user Type 2: find all the URLs and PCIDs Type 3: find all the most recent properties Type 4: find all the historic properti 48

Impact from Storage Hardware 2 Dataset: IBM Knowledge Repository 138614 Nodes, 1617898 Edges OS buffer is flushed before test Processing 320 queries in parallel In memory graph cache size: 4GB (System G default value) 49

Homework 3 1. Classification: 1.1: As introduced by TA in Demo Session 3, using the 20 newsgroups data, try various classification algorithms provided by Mahout, and discuss their performance 1.2: Do similar experiments as in 1.1 on the Wikipedia data that you downloaded in Homework 2.! 2. Graph Database: 2.1: Use the two datasets that you used in Homework #2, Item 1 Recommendation. Inject them into a graph database, e.g., Neo4j or ScaleGraph DB. Use the graph query language to get person-person recommendation and item-item recommendation. Compare results with Mahout. 2.2: (Optional) Use the URL links in the Wikipedia data that you downloaded. Create a knowledge graph. Inject it into a graph database. Try some queries to find relevant terms. This can serve as keyword expansion. 50

Questions? 51