Apache Giraph! start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Similar documents

Large Scale Graph Processing with Apache Giraph

Internals of Hadoop Application Framework and Distributed File System

Hadoop IST 734 SS CHUNG

Getting to know Apache Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data With Hadoop

Extreme Computing. Hadoop MapReduce in more detail.

CloudStack and Big Data. Sebastien May 22nd 2013 LinuxTag, Berlin

CS54100: Database Systems

Introduction to Big Data Training

Peers Techno log ies Pv t. L td. HADOOP

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Introduction to Cloud Computing

Big Data and Scripting Systems beyond Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Evaluating partitioning of big graphs

CSE-E5430 Scalable Cloud Computing Lecture 2

ITG Software Engineering

CS 378 Big Data Programming

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Map-Reduce and Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Analytics. Lucas Rego Drumond

Architectures for massive data management

L1: Introduction to Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Introduction to DISC and Hadoop

HiBench Introduction. Carson Wang Software & Services Group

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

COURSE CONTENT Big Data and Hadoop Training

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

BIG DATA IN BUSINESS ENVIRONMENT

Professional Hadoop Solutions

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop Parallel Data Processing

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Big Data and Scripting Systems build on top of Hadoop

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

TRAINING PROGRAM ON BIGDATA/HADOOP

University of Maryland. Tuesday, February 2, 2010

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Introduction to MapReduce and Hadoop

How To Use Hadoop

Introduction to Hadoop

Apache Bigtop: 100% Apache Bigdata management distribution. (and so much more!)

How To Handle Big Data With A Data Scientist

Apache Hadoop: The Big Data Refinery

Big Data and Apache Hadoop s MapReduce

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Hadoop and Map-reduce computing

Map Reduce & Hadoop Recommended Text:

Managing large clusters resources

How Companies are! Using Spark

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Graph Processing with Apache TinkerPop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Community Driven Apache Hadoop. Apache Hadoop Basics. May Hortonworks Inc.

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Lecture 10 - Functional programming: Hadoop and MapReduce

Apache Hama Design Document v0.6

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Certified Big Data and Apache Hadoop Developer VS-1221

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Gerrit and Jenkins for Big Data Continuous Delivery. Santa Clara, CA, September 2-3

NoSQL and Hadoop Technologies On Oracle Cloud

How To Scale Out Of A Nosql Database

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Flink. Fast and Reliable Large-Scale Data Processing

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Workshop on Hadoop with Big Data

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Transcription:

Apache Giraph! start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Who s this guy?

Roman Shaposhnik Sr. Manager at Pivotal Inc. building a team of ASF contributors ASF junkie VP of Apache Incubator, former VP of Apache Bigtop Hadoop/Sqoop/Giraph committer contributor across the Hadoop ecosystem) Used to be root@cloudera Used to be a PHB at Yahoo! Used to be a UNIX hacker at Sun microsystems

Giraph in action (MEAP) http://manning.com/martella/

What s this all about?

Agenda A brief history of Hadoop-based bigdata management Extracting graph relationships from unstructured data A case for iterative and explorative workloads Bulk Sequential Processing to the rescue Apache Giraph: a Hadoop-based BSP graph analysis framework Giraph application development Demos! Code! Lots of it!

On day one Doug created HDFS/MR

Google papers GFS (file system) distributed replicated non-posix MapReduce (computational framework) distributed batch-oriented (long jobs; final results) data-gravity aware designed for embarrassingly parallel algorithms

One size doesn t fit all Key-value approach map is how we get the keys shuffle is how we sort the keys reduce is how we get to see all the values for a key Pipeline approach Intermediate results in a pipeline need to be flushed to HDFS A very particular API for working with your data

It s not about the size of your data;! it s about what you do with it!

Graph relationships Entities in your data: tuples customer data product data interaction data Connection between entities: graphs social network or my customers clustering of customers vs. products

Challenges Data is dynamic No way of doing schema on write Combinatorial explosion of datasets Relationships grow exponentially Algorithms become explorative iterative

Graph databases Plenty available Neo4J, Titan, etc. Benefits Tightly integrate systems with few moving parts High performance on known data sets Shortcomings Don t integrate with HDFS Combine storage and computational layers A sea of APIs

Enter Apache Giraph

Key insights Keep state in memory for as long as needed Leverage HDFS as a repository for unstructured data Allow for maximum parallelism (shared nothing) Allow for arbitrary communications Leverage BSP approach

Bulk Sequential Processing local processing barrier #1 barrier #2 communications barrier #3 time

BSP applied to graphs @rhatr @TheASF @c0sin Think like a vertex: I know my local state I know my neighbours I can send messages to vertices I can declare that I am done I can mutate graph topology

Bulk Sequential Processing message delivery individual vertex processing & sending messages superstep #1 all vertices are done superstep #2 time

Giraph Hello World public class GiraphHelloWorld extends BasicComputation<IntWritable, IntWritable, NullWritable, NullWritable> { public void compute(vertex<intwritable, IntWritable, NullWritable> vertex, Iterable<NullWritable> messages) { System.out.println( Hello world from the: + vertex.getid() + : ); for (Edge<IntWritable, NullWritable> e : vertex.getedges()) { System.out.println( + e.gettargetvertexid()); } System.out.println( ); } }

Mighty four of Giraph API BasicComputation<IntWritable, // VertexID -- vertex ref IntWritable, // VertexData -- a vertex datum NullWritable, // EdgeData -- an edge label datum NullWritable>// MessageData -- message payload

On circles and arrows You don t even need a graph to begin with! Well, ok you need at least one node Dynamic extraction of relationships EdgeInputFormat VetexInputFormat Full integration with Hadoop ecosystem HBase/Accumulo, Gora, Hive/HCatalog

Anatomy of Giraph run HDFS mappers reducers HDFS 1 @TheASF 2 @rhatr 3 @c0sin OutputFormat 3 1 2 InputFormat 1 2 1 3 1 2 1 3 3 1 2

Anatomy of Giraph run mappers or YARN containers 1 2 @rhatr 3 @c0sin OutputFormat InputFormat @TheASF

A vertex view MessageData1 VertexData MessageData2 VertexID

Turning Twitter into Facebook TheASF @TheASF @rhatr rhatr @c0sin c0sin

Ping thy neighbours public void compute(vertex<text, DoubleWritable, DoubleWritable> vertex, Iterable<Text> ms ){ if (getsuperstep() == 0) { sendmessagetoalledges(vertex, vertex.getid()); } else { for (Text m : ms) { if (vertex.getedgevalue(m) == null) { vertex.addedge(edgefactory.create(m, SYNTHETIC_EDGE)); } } } vertex.votetohalt(); }

Demo time!

But I don t have a cluster! Hadoop in pseudo-distributed mode All Hadoop services on the same host (different JVMs) Hadoop-as-a-Service Amazon s EMR, etc. Hadoop in local mode

Prerequisites Apache Hadoop 1.2.1 Apache Giraph 1.1.0-SNAPSHOT Apache Maven 3.x JDK 7+

Setting things up $ curl hadoop.tar.gz tar xzvf $ git clone git://git.apache.org/giraph.git ; cd giraph $ mvn Phadoop_1 package $ tar xzvf *dist*/*.tar.gz!! $ export HADOOP_HOME=/Users/shapor/dist/hadoop-1.2.1 $ export GIRAPH_HOME=/Users/shapor/dist/ $ export HADOOP_CONF_DIR=$GIRAPH_HOME/conf $ PATH=$HADOOP_HOME/bin:$GIRAPH_HOME/bin:$PATH

Setting project up (maven) <dependencies> <dependency> <groupid>org.apache.giraph</groupid> <artifactid>giraph-core</artifactid> <version>1.1.0-snapshot</version> </dependency> <dependency> <groupid>org.apache.hadoop</groupid> <artifactid>hadoop-core</artifactid> <version>1.2.1</version> </dependency> </dependencies>

Running it $ mvn package $ giraph target/*.jar giraph.giraphhelloworld \ -vip src/main/resources/1 \ -vif org.apache.giraph.io.formats.intintnulltextinputformat \ -w 1 \ -ca giraph.splitmasterworker=false,giraph.loglevel=error

Testing it public void testnumberofvertices() throws Exception { GiraphConfiguration conf = new GiraphConfiguration(); conf.setcomputationclass(giraphhelloworld.class); conf.setvertexinputformatclass(textdoubledoubleadjacencylistvertexinputformat.class); Iterable<String> results = InternalVertexRunner.run(conf, graphseed); }

Simplified view mappers or YARN containers 1 2 @rhatr 3 @c0sin OutputFormat InputFormat @TheASF

Master and master compute Active Master 1 compute() @TheASF Zookeeper 2 @rhatr 3 @c0sin Standby Master

Master compute Runs before slave compute() Has a global view A place for aggregator manipulation

Aggregators Shared variables Each vertex can push values to an aggregator Master compute has full control 1 @TheASF 1 Active Master min: sum: 1 3 1 2 @rhatr 2 2 3 @c0sin

Questions?