Graph Processing with Apache TinkerPop Jason Plurad Software Engineer, IBM Committer, Apache TinkerPop
Project Update Graph Landscape A Graph Problem Hands-On Graph http://tinkerpop.apache.org
About Me Twitter @pluradj GitHub @pluradj Open channels TinkerPop mailing lists Users Dev Titan mailing list Stack Overflow
Apache TinkerPop 2009: Inception 2012: TinkerPop 2 2015: Apache Incubator 2016: TLP VOTE passed! Waiting on board meeting to establish TLP
Podling Releases 3.0 Major refactor, Java 8 lambda expressions, Gremlin Server, OLAP graph computers 3.1 Hadoop 2 support, persisted RDDs 3.2 OLAP job chaining, OLAP graph filters, performance improvements
Common graph data domains Social Network Analysis Configuration Management Database Master Data Management Recommendation Engines Knowledge Graphs Internet of Things
Property Graph and Gremlin Structure Vertex Edge Properties Traversal Steps Gremlin Functional Data flow: forward and backward Domain specific language (DSL) for graph
Apache TinkerPop Graph Computing Framework
Graph Landscape Graph database vs Graph processor OLTP vs OLAP Neighborhood vs whole graph Multi-model: not the only store in your app
IBM Graph (Beta) Managed Graph-as-a-Service (OLTP) Focus on your data, not install and operations #sleepmore http://ibm.biz/ibmgraph
What is this? module.exports = xxxxxxx; function xxxxxxx (str, len, ch) { str = String(str); var i = -1; if (!ch && ch!== 0) ch = ' '; len = len - str.length; while (++i < len) { str = ch + str; } return str; }
A Graph Problem: Dependency Management On March 22, 2016 npm broke the Internet Left-pad was unpublished 11 lines of code WTFPL license Hundreds of breaking builds per minute http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm Are we safe with Apache?
Questions for the graph Which dependencies are at risk? Which ones should be refactored to avoid? Risk factors Unsuitable license Single developer Too little code / Too much code Changes too frequently / Code is stagnant Nobody else is using it
Let s go for a ride!
Titan (Aurelius) Pick a graph database for OLTP Storage in Apache Cassandra or Apache HBase Apache license but not in ASF Code has stagnated in the open TinkerPop version bumps DataStax Enterprise (DSE) Graph Wide open opportunities Apache S2Graph (incubating) Apache Flink (Gelly) Apache Solr (GraphQuery) Others possibilities!
Apache Spark or Apache Giraph Pick a graph processor for OLAP Spark is the new hotness Giraph is better suited for gigantic graphs By using Apache TinkerPop and Gremlin, we can use either one seamlessly
Vagrant and Virtualbox Developers don t always get keys to the cloud Virtual machines to the rescue Host: 16 GB RAM or more 3-4 VMs with 3 GB RAM Prove out your graph algorithms on a small data set before wasting time on a big data set
Apache Ambari Simple install for Apache Hadoop and related Apache big data packages HDFS, HBase, Zookeeper, etc Management and monitoring dashboard Enables integration of other software
Hands-On: Gremlin Console
Getting the data NPM registry runs on Apache CouchDB Replication in Apache CouchDB is awesome https://skimdb.npmjs.com/registry
Transform the data CouchDB is a document store Dependencies are graph data Other things can be too Users Keywords License Graph model depends on the questions you want to ask of the graph
Person 125K NPM Graph Schema License 2K Document 250K license Keyword 81K dependency devdependency Package 1.5M
The GraphComputer
Anatomy of a Vertex Program Vertex-centric graph logic Parallel execution (BSP)
Out of the box Vertex Programs Traversal BulkLoader BulkDumper PageRank PeerPressure
Hands-On: Graph Program
Next stop? More data! Graphs are for connecting data! Consume data from GitHub User data Static code analysis Code usage analysis Consume data from Twitter Trending news Security alerts
Summary Apache TinkerPop is for graph computing OLTP vs OLAP is an important distinction Gremlin allows you to seamless bridge the two Graph thinking is different than relational Is the future multi-model? Many opportunities to innovate in this space
Acknowledgements Marko Rodriguez Gremlin language, Gremlin OLAP Ketrina Yim Illustrator, creator of Gremlin and friends Stephen Mallette TinkerPop release manager, Gremlin applications Daniel Kuppitz Gremlin language guru David Robinson Big data, multi-model architect/developer
Questions?
Thank you!