Graph Processing with Apache TinkerPop

Similar documents
Big Data Graphs and Apache TinkerPop 3. David Robinson, Software Engineer April 14, 2015

Introduction to Big Data Training

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Big Data and Scripting Systems beyond Hadoop

How graph databases started the multi-model revolution

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

HDP Hadoop From concept to deployment.

BIG DATA USING HADOOP

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

AQUA Private Registry Cloud Technical Overview

Comprehensive Analytics on the Hortonworks Data Platform

Data Services Advisory

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Search and Real-Time Analytics on Big Data

Upcoming Announcements

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Apache Sentry. Prasad Mujumdar

Unified Big Data Processing with Apache Spark. Matei

Large-Scale Data Processing

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Apache HBase. Crazy dances on the elephant back

Large Scale Graph Processing with Apache Giraph

Moving From Hadoop to Spark

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Hortonworks CISC Innovation day

Scaling Out With Apache Spark. DTL Meeting Slides based on

HDP Enabling the Modern Data Architecture

Cloud Computing and Big Data What Technical Writers Need to Know

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

High-Speed In-Memory Analytics over Hadoop and Hive Data

Use of Hadoop File System for Nuclear Physics Analyses in STAR

How To Create A Data Visualization With Apache Spark And Zeppelin

Client Overview. Engagement Situation. Key Requirements

Data processing goes big

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Cloud Scale Distributed Data Storage. Jürmo Mehine

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Using Kafka to Optimize Data Movement and System Integration. Alex

Apache Flink. Fast and Reliable Large-Scale Data Processing

Apache Zeppelin, the missing component for your BigData ecosystem

Kafka & Redis for Big Data Solutions

Challenges for Data Driven Systems

Certified Big Data and Apache Hadoop Developer VS-1221

Big Data Course Highlights

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop Ecosystem B Y R A H I M A.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Workshop. dattamsha.com

Spark: Cluster Computing with Working Sets

SAP and Hortonworks Reference Architecture

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Brave New World: Hadoop vs. Spark

Enterprise Operational SQL on Hadoop Trafodion Overview

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

CS54100: Database Systems

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

So What s the Big Deal?

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

FusionHub Virtual Appliance

Apache Hama Design Document v0.6

Unified Batch & Stream Processing Platform

Architectures for massive data management

BIG DATA TOOLS. Top 10 open source technologies for Big Data

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Workshop on Hadoop with Big Data

Big Graph Data Management

Peers Techno log ies Pv t. L td. HADOOP

Best Practices for Dashboard Design with SAP BusinessObjects Design Studio

Apache James: more than s in the cloud. Ioan Eugen Stan Berlin Buzzwords 2012

CloudStack and Big Data. Sebastien May 22nd 2013 LinuxTag, Berlin

Grid vs. Cloud Computing

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

Machine- Learning Summer School

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

OpenAdmin Tool for Informix (OAT) October 2012

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Lessons Learned: Building a Big Data Research and Education Infrastructure

Can the Elephants Handle the NoSQL Onslaught?

Trend Micro Big Data Platform and Apache Bigtop. 葉 祐 欣 (Evans Ye) Big Data Conference 2015

Big Data for Investment Research Management

6.S897 Large-Scale Systems

Getting Real Real Time Data Integration Patterns and Architectures

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Transcription:

Graph Processing with Apache TinkerPop Jason Plurad Software Engineer, IBM Committer, Apache TinkerPop

Project Update Graph Landscape A Graph Problem Hands-On Graph http://tinkerpop.apache.org

About Me Twitter @pluradj GitHub @pluradj Open channels TinkerPop mailing lists Users Dev Titan mailing list Stack Overflow

Apache TinkerPop 2009: Inception 2012: TinkerPop 2 2015: Apache Incubator 2016: TLP VOTE passed! Waiting on board meeting to establish TLP

Podling Releases 3.0 Major refactor, Java 8 lambda expressions, Gremlin Server, OLAP graph computers 3.1 Hadoop 2 support, persisted RDDs 3.2 OLAP job chaining, OLAP graph filters, performance improvements

Common graph data domains Social Network Analysis Configuration Management Database Master Data Management Recommendation Engines Knowledge Graphs Internet of Things

Property Graph and Gremlin Structure Vertex Edge Properties Traversal Steps Gremlin Functional Data flow: forward and backward Domain specific language (DSL) for graph

Apache TinkerPop Graph Computing Framework

Graph Landscape Graph database vs Graph processor OLTP vs OLAP Neighborhood vs whole graph Multi-model: not the only store in your app

IBM Graph (Beta) Managed Graph-as-a-Service (OLTP) Focus on your data, not install and operations #sleepmore http://ibm.biz/ibmgraph

What is this? module.exports = xxxxxxx; function xxxxxxx (str, len, ch) { str = String(str); var i = -1; if (!ch && ch!== 0) ch = ' '; len = len - str.length; while (++i < len) { str = ch + str; } return str; }

A Graph Problem: Dependency Management On March 22, 2016 npm broke the Internet Left-pad was unpublished 11 lines of code WTFPL license Hundreds of breaking builds per minute http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm Are we safe with Apache?

Questions for the graph Which dependencies are at risk? Which ones should be refactored to avoid? Risk factors Unsuitable license Single developer Too little code / Too much code Changes too frequently / Code is stagnant Nobody else is using it

Let s go for a ride!

Titan (Aurelius) Pick a graph database for OLTP Storage in Apache Cassandra or Apache HBase Apache license but not in ASF Code has stagnated in the open TinkerPop version bumps DataStax Enterprise (DSE) Graph Wide open opportunities Apache S2Graph (incubating) Apache Flink (Gelly) Apache Solr (GraphQuery) Others possibilities!

Apache Spark or Apache Giraph Pick a graph processor for OLAP Spark is the new hotness Giraph is better suited for gigantic graphs By using Apache TinkerPop and Gremlin, we can use either one seamlessly

Vagrant and Virtualbox Developers don t always get keys to the cloud Virtual machines to the rescue Host: 16 GB RAM or more 3-4 VMs with 3 GB RAM Prove out your graph algorithms on a small data set before wasting time on a big data set

Apache Ambari Simple install for Apache Hadoop and related Apache big data packages HDFS, HBase, Zookeeper, etc Management and monitoring dashboard Enables integration of other software

Hands-On: Gremlin Console

Getting the data NPM registry runs on Apache CouchDB Replication in Apache CouchDB is awesome https://skimdb.npmjs.com/registry

Transform the data CouchDB is a document store Dependencies are graph data Other things can be too Users Keywords License Graph model depends on the questions you want to ask of the graph

Person 125K NPM Graph Schema License 2K Document 250K license Keyword 81K dependency devdependency Package 1.5M

The GraphComputer

Anatomy of a Vertex Program Vertex-centric graph logic Parallel execution (BSP)

Out of the box Vertex Programs Traversal BulkLoader BulkDumper PageRank PeerPressure

Hands-On: Graph Program

Next stop? More data! Graphs are for connecting data! Consume data from GitHub User data Static code analysis Code usage analysis Consume data from Twitter Trending news Security alerts

Summary Apache TinkerPop is for graph computing OLTP vs OLAP is an important distinction Gremlin allows you to seamless bridge the two Graph thinking is different than relational Is the future multi-model? Many opportunities to innovate in this space

Acknowledgements Marko Rodriguez Gremlin language, Gremlin OLAP Ketrina Yim Illustrator, creator of Gremlin and friends Stephen Mallette TinkerPop release manager, Gremlin applications Daniel Kuppitz Gremlin language guru David Robinson Big data, multi-model architect/developer

Questions?

Thank you!