GraphBuilder Scalable Graph Construction using Apache TM Hadoop TM

Similar documents
The Power of Relationships

Large-Scale Data Processing

Big Graph Processing: Some Background

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Machine Learning over Big Data

Analysis of Web Archives. Vinay Goel Senior Data Engineer

CSE-E5430 Scalable Cloud Computing Lecture 2

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Evaluating partitioning of big graphs

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

Fast, Low-Overhead Encryption for Apache Hadoop*

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Dell Reference Configuration for Hortonworks Data Platform

Intel Media SDK Library Distribution and Dispatching Process

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

The Transition to PCI Express* for Client SSDs

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Intel Service Assurance Administrator. Product Overview

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Search and Real-Time Analytics on Big Data

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

HIGH PERFORMANCE BIG DATA ANALYTICS

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Benefits of Intel Matrix Storage Technology

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Platforms

Benchmarking Cloud Storage through a Standard Approach Wang, Yaguang Intel Corporation

Intel Desktop Board DP55WB

Big Data and Natural Language: Extracting Insight From Text

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

Chapter 7. Using Hadoop Cluster and MapReduce

Intel Desktop Board D945GCPE

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Intel RAID SSD Cache Controller RCS25ZB040

Enhanced Diagnostics Improve Performance, Configurability, and Usability

Big Data and Apache Hadoop s MapReduce

Intel Desktop Board DG41BI

Intel SSD 520 Series Specification Update

iscsi Quick-Connect Guide for Red Hat Linux

Intel Desktop Board D945GCPE Specification Update

Deploying Hadoop with Manager

Intel Desktop Board DG31GL

Intel Desktop Board DG41TY

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Intel Desktop Board DG41WV

Information Processing, Big Data, and the Cloud

Oracle Big Data Spatial and Graph

Intel Desktop Board DG31PR

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

This guide explains how to install an Intel Solid-State Drive (Intel SSD) in a SATA-based desktop or notebook computer.

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Intel Desktop Board DQ965GF

COMP9321 Web Application Engineering

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Distributed Computing and Big Data: Hadoop and MapReduce

Intel Desktop Board DP43BF

VNF & Performance: A practical approach

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Intel Desktop Board DG43RK

Intel Platform and Big Data: Making big data work for you.

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High Availability Server Clustering Solutions

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Intel Solid-State Drive Pro 2500 Series Opal* Compatibility Guide

Maximizing Hadoop Performance with Hardware Compression

Scaling Networking Solutions for IoT Challenges and Opportunities

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Apache Flink Next-gen data analysis. Kostas

Graph Database Proof of Concept Report

Creating Overlay Networks Using Intel Ethernet Converged Network Adapters

Significantly Speed up real world big data Applications using Apache Spark

Intel Identity Protection Technology Enabling improved user-friendly strong authentication in VASCO's latest generation solutions

Systems and Algorithms for Big Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Intel Integrated Native Developer Experience (INDE): IDE Integration for Android*

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

Graph Processing and Social Networks

System Image Recovery* Training Foils

DELL s Oracle Database Advisor

Intel Desktop Board DQ35JO

Large Scale Text Analysis Using the Map/Reduce

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Video Encoding on Intel Atom Processor E38XX Series using Intel EMGD and GStreamer

Intel Matrix Storage Manager 8.x

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Data and Analytics: Challenges and Opportunities

Transcription:

GraphBuilder Scalable Graph Construction using Apache TM Hadoop TM Ted Frank Jay Acknowledgements: Carlos Guestrin et al. (CMU/UW) (Collaboration through Intel Science and Technology Center) Intel Labs

Data Analytics and Machine Learning Framework 2

Dénes König Perhaps even more than to the link between mankind and nature, graph theory owes to the link of human beings between each other. English translation of Vielleicht noch mehr als der Berührung der Menschheit mit der Natur verdankt die Graphentheorie der Berührung der Menschen untereinander. From 1 st Graph Theory Book (1936) Image source: [Wikipedia]

C A D Bridges of Königsberg B Frank Harary s bibliography Wordle Social links 1736 1850 1936 1969 1985+ 21 st Century Leonhard Euler Francis Guthrie William Rowan Hamilton Dénes König Frank Harary Computing Revolution Internet Image source: [Wikipedia] [msclub.info] [divisbyzero.com] [Wordle.com] [nature.com]

Natural graphs Graphs derived from natural phenomena.

Don t have idealized structure You Me

http://inmaps.linkedinlabs.com/ Vertex view Don t have regularized structure Over lapping communities

Natural graphs follow preferential attachments They grow with time Rich get Richer Image source: [Wikipedia]

Number of Vertices Power-Law Degree Distribution More than 10 6 vertices have one neighbor. Top 1% of vertices are High-Degree adjacent to Vertices 50% of the edges! Twitter Follow Graph V 41M, E 1.4B Out Degree Image source: [Wikipedia] [cmu.edu/~pegasus]

Graphs are omnipresent! 100B Neuron 100T Relationships 1B Users 140B Friendships 1Trillion Pages 100s T Links Human Brain Social Network Internet Millions of Products & Users 27M Users 70K Movies Large Biological Cell Networks e-commerce Online Services Science Big in size and rich in metadata Image source: [Wikipedia][alz.org] [Facebook]

Graphs are Essential to Data Mining and Machine Learning Identify influential people and information Find communities Understand people s shared interests Model complex data dependencies 11

Identifying influential people Construct a graph Feature Extraction Graph Formation Social Networking Data Graph Construction Data-Parallel Graph Image source: [Wikipedia]

Need a Model (Algorithm) PageRank: [Page et al. 1998]

How many people are pointing to you and what s their relative importance? Depends on rank of who follows them Depends on rank of who follows her What s the rank of this user? Rank? Loops in graph - Must iterate! Graphics source: [Joseph Gonzalez (CMU)]

Properties of Graph-Structured Computation Dependency Graph Local Update Iterative Computations Similar properties for many other problems! Graphics source: [Joseph Gonzalez (CMU)]

How do we program graph computation? Think like a Vertex Malewicz et al. [SIGMOD 10]

The Graph-Parallel Abstraction A user-defined Vertex-Program runs on each vertex Graph constrains interaction along edges Using messages Through shared state Parallelism: Run multiple vertex programs simultaneously Graphics source: [Joseph Gonzalez (CMU)]

Distributed Graph Analytics System Data Graph Construction (Feature Extraction, Graph formation) Structured Machine Learning or Data Mining (Identify influential person) Value Graph Ingress mostly data-parallel Graph-Structured Computation graph-parallel Efficient design requires balanced utilization Image source: [Wikipedia]

System-Level Challenges Too Big to fit (in system memory) 1B User, 140B relationship Balanced cut (Power-law graphs are difficult to cut) Work imbalance (Execution on vertex is proportional to degree of vertex) Image source: [Facebook]

Difficult to Partition Power-Law graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04] http://inmaps.linkedinlabs.com/

Edge cut: Partitioning Approaches Impact on system performance Y Vertex cut: Machine 1 Machine 2 Y Must synchronize many edges Any edge cut can directly construct a vertex cut which requires strictly less communications and storage. [Gonzalez et al. 2012] Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000] Y Y Must synchronize a single vertex Machine 1 Machine 2 Graphics source: [Joseph Gonzalez (CMU)]

Distributed Graph Analytics Environment Program For This Run on This Machine 1 Machine 2 Master Slave Split High-Degree vertices Graphics source: [Joseph Gonzalez (CMU)]

Graph-Structured Computational Pregel - Malewicz et al. [PODC 09, SIGMOD 10] Frameworks Apache Giraph 2011 Carlos Guestrin et al. [UAI 10, OSDI 12] Others: Kineograph, Stanford GPS, Dryad, BoostPGL, Pegasus, Microsoft Trinity, and Signal-Collect

PageRank Performance Hadoop 13.3 hrs GraphLab 14 min 57x Not a natural fit for Graph-Parallel Abstraction Must store graph s state after every iteration Twitter Graph V 41M E 1.4 Billion 8-node Intel Sandy Bridge E3-1280 Cluster, 16GB/node, 10GbE, 2x SSDs (550 MB/s each)

Graph Construction

I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis. Most of the time I m lucky if I get to do any analysis at all. Anonymous Data Scientist from Jeff Heer s (Stanford) interview study, 2012

Building Graphs for Practical Apps Raw Data Preprocessing Graph Formation Add Network Information Influential Person Social Networking Extract User and Relationship Directed Graph N/A Hidden Topic analysis XML Docs Extract Doc & Words Bipartite (Doc, Words) Word Frequency or TFIDF Recommendation System Activity Logs Extract User Item and Rating Bipartite (User, item) Rating

And, in practice and at scale we must: Raw Data Preprocessing Graph Formation Add Network Information Finalize for Parallel Computation Minimize the use of system resources, like memory, storage, etc. Natural Graph partitioning to ensure computational effort is load balanced Do our best to ensure the graph we generated is the one we intended to but the Data Scientist shouldn t be responsible for this domain expertise!

Graph Abstraction Library: GraphBuilder Hidden Topic Analysis Relative Ranking Analysis Graph Computation Graph Abstraction Offloads domain expertise Written in Java for easy use in Hadoop MapReduce and applications Data Store Completes Graph Analytics pipeline

GraphBuilder Data flow Extract Transform Load Graph formation from data source(s) Apply cleaning and transformation Prepare for graph analytics HDFS DB XML Docs Feature Extraction Tabulation Graph Checks and Transformation Graph Compression, Partitioning, and Serialization App-Specific Code GraphBuilder Library

Extract - Graph Formation Extract features from data to construct relationship (RecordReader) (GraphTokenizer) V Optional: Reduce ( Y ) f(x) conf.set(xmlinputformat.start_tag_key, START_TAG); conf.set(xmlinputformat.end_tag_key, END_TAG); new XMLRecordReader((FileSplit) split, conf); Read vertex object Document doc = builder.parse(new InputSource(new StringReader(s))); title = xpath.evaluate("//page/title/text()", doc); title = title.replaceall("\\s", "_"); id = xpath.evaluate("//page/id/text()", doc); String text = xpath.evaluate("//page/revision/text/text()", doc); parselinks(text); Parser Feature Extraction Write simple data-specific functions. Program sequential, not parallel!

Extract - Tabulation Built-in tabulation functions for TF, TFIDF, WC, ADD, MUL, DIV. Interface for custom tabulation on source and/or target vertex Example: Term Frequency Cat Cat D1 Rat User Defined: Reduce ( Y ) f(x) D1 Rat Mat Net Bat Apply (f(x)) Y Mat Net Bat

Transform Graph Transforms & Checks Would like the ability to: Optionally filter duplicate, dangling and/or self edges Transform a directed graph into an undirected graph Calculate graph statistics, compute sub-graphs, etc. The library provides: Functions to perform self-, dangling- and duplicate-edge removal Directionality transformation Solutions are based on a distributed hashing algorithm M1 A C B D H(A, B) H(C, D) A B Detector M2 B D A C H(A, B) H(C, D) Detector Steering function

Load - Graph Compression We can save memory if we normalize it (e.g., compress Link graph by 60%) But, seems to call for a global lookup in a framework that prefers independent subproblems A simple, scalable solution is to shard ordered lists: Dictionary (Aaron,0) (AMD,4) (Brad,1) (CMU,2) (Dan,5) (Dave,3) (IBM,6) (Intel,7) Dictionary Shard 1 (Aaron,0) (AMD,4) (Brad,1) (CMU,2) Dictionary Shard 2 (Dan,5) (Dave,3) (IBM,6) (Intel,7) Unconverted Edge List (Aaron,IBM) (Brad,Intel) (Dan,AMD) (Dave,CMU) (Source Sorted) M1 M2 (AMD,5) (CMU,3) (IBM,0) (Intel,1) (Dest Sorted) Converted Edge List (5,4) (3,2) (0,6) (1,7)

Load - Graph Partitioning Minimize communications by minimizing the number of machines vertex spans D C 1 1 2 Place about the same number of edges on each machine A 1 2 B Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. [Abou-Rjeili et al. 06]

Heuristic-Based Partitioning Strategies Random edge placement Edges are placed randomly by each system Greedy edge placement Global coordination for edge placement to minimizes the vertex spanned Oblivious greedy placement implements a local version of the Greedy without global coordination

Greedy Algorithm Place edges on machines which already have the vertices on that edge while ensuring balance loading. A Master B Slave B C H F Machine 1 Machine 2 B E

Smaller is better # of Vertex copies Partitioning Quality Twitter Graph: 41M vertices, 1.4B edges 18 16 14 12 10 8 6 4 2 8 16 24 32 40 48 56 64 Number of Machines Greedy yields a quality cut, but what is the effect on performance? *Gonzalez et al., PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, [OSDI 12]

Relative Runtime Performance Effect 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Random Oblivious Greedy Greedy PageRank Collaborative Filtering Shortest Path Performance is inversely proportional to replication. *Gonzalez et al., PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, [OSDI 12]

Load - Graph Serialization Partitioning JSON Encoding Edge Lists Vertex Lists Self-describing data format JSON +/- compression Extensible Easy to connect with Graph Databases Plug-in Graph Visualizers { } src_id : 34, dest_id : 45 e-data : 30 { } ver_id : 34, v-data : 56, mirror : [1,2,3], owner : 1

GraphBuilder Software stack Built-in Parser/Tabulator Custom Parser/Tabulator Extract Transform Load Hadoop/Map-Reduce Distributed Graph Hadoop/HDFS Linux Cluster Services (Amazon AWS) Private Linux Cluster

Speed of Graph Construction Wikipedia Graphs Word-Doc Graph 45 min V 54M, E 1.4B Link Graph 13 min V 20M, E 128M Execution time α O( V ) Extract Transform Load Graph Compression Custom plug-in code Link 60% 100 lines Word-Doc 5% 130 lines Hardware: 8 node cluster Software: 1U Dual CPU (Intel SNB) Amazon build ZT systems Apache Hadoop 1.0.1 64 GB Memory, Four SATA Hard Drives GraphLab v2.1 Intel 10G Adapter and Switch GraphBuilder beta

Summary Graphs are essential for structured ML and DM High-performing Graph-Analytics pipelines requires careful system design GraphBuilder solves the Graph Analytics ingress challenge

Going forward Available soon (mid-nov)! Intel Open-Source Portal http://www.01.org GraphLab is available at http://graphlab.org Apache 2 license Interested in collaboration Would like to hear from you! (Office hours: Thursday 10:50am) Intel Booth (#27) Real time Analytics, Hadoop Benchmarking www.intel.com/bigdata

Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright 2012 Intel Corporation.