The Power of Relationships

Similar documents
Big Graph Processing: Some Background

Large-Scale Data Processing

Machine Learning over Big Data

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Hadoop Ecosystem B Y R A H I M A.

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data and Analytics: Challenges and Opportunities

Analysis of Web Archives. Vinay Goel Senior Data Engineer

COMP9321 Web Application Engineering

Intel Media SDK Library Distribution and Dispatching Process

Evaluating partitioning of big graphs

Apache Hama Design Document v0.6

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Search and Real-Time Analytics on Big Data

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Oracle Big Data Spatial and Graph

HIGH PERFORMANCE BIG DATA ANALYTICS

Unified Big Data Processing with Apache Spark. Matei

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

CSE-E5430 Scalable Cloud Computing Lecture 2

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Big Data and Natural Language: Extracting Insight From Text

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Deploying Hadoop with Manager

The Transition to PCI Express* for Client SSDs

Using Data Mining and Machine Learning in Retail

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data Analytics. Lucas Rego Drumond

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Architectures for Big Data Analytics A database perspective

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

An Oracle White Paper October Oracle: Big Data for the Enterprise

Open source Google-style large scale data analysis with Hadoop

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Oracle Big Data SQL Technical Update

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

Domain driven design, NoSQL and multi-model databases

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

BIG DATA TRENDS AND TECHNOLOGIES

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

SEIZE THE DATA SEIZE THE DATA. 2015

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Fast, Low-Overhead Encryption for Apache Hadoop*

Next-Gen Big Data Analytics using the Spark stack

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Intel Desktop Board DP55WB

Intel Platform and Big Data: Making big data work for you.

Benefits of Intel Matrix Storage Technology

Big Data Analytics. Chances and Challenges. Volker Markl

Big Data and Apache Hadoop s MapReduce

Information Processing, Big Data, and the Cloud

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Intel Service Assurance Administrator. Product Overview

Using distributed technologies to analyze Big Data

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Challenges for Data Driven Systems

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

An Approach to Implement Map Reduce with NoSQL Databases

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

How To Make Sense Of Data With Altilia

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

An Oracle White Paper June Oracle: Big Data for the Enterprise

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Significantly Speed up real world big data Applications using Apache Spark

How To Scale Out Of A Nosql Database

A Brief Introduction to Apache Tez

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Brave New World: Hadoop vs. Spark

Apache Flink Next-gen data analysis. Kostas

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

High Performance Computing and Big Data: The coming wave.

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Best Practices for Hadoop Data Analysis with Tableau

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Intel Desktop Board DG31GL

Transcription:

The Power of Relationships Opportunities and Challenges in Big Data Intel Labs Cluster Computing Architecture

Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to speci fications and product descriptions at any time, without notice. All products, dates, and figures speci fied are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published speci fications. Current characterized errata are available on request. Code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user Performance tests and ratings are measured using speci fic computer systems and/or components and re flect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or con figuration may affect actual performance. Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright 2013 Intel Corporation.

Target knows when you are pregnant. How company learn your secrets by Charles Duhigg in NY Times Magazine [Feb, 2012] As Pole s [Target statistician] computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a pregnancy prediction score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very speci fic stages of her pregnancy. Target analyst noted that sometime in the first 20 weeks, pregnant women load up on supplements like calcium, magnesium and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date. Image source: [NY Times]

Mining Relationships for Recommendations Customers Who Bought This Item Also Bought What? Dog Food #1 Dog Food? Milo s Meatball Treats #2 Greenies for Teeth #3 #527 Richell s Pet Pen Callaway Diablo Driver

Graphs are omnipresent! 100B Neuron 100T Relationships 1B Users 140B Friendships 1Trillion Pages 100s T Links Human Brain Social Network Internet Millions of Products & Users e-commerce 27M Users 70K Movies Online Services Large Biological Cell Networks Life Science Big in size and rich in metadata Image source: [Wikipedia][alz.org] [Facebook]

Use of Graphs: Evolution of Graph Applications Graph structure mining Shortest path, reachability, PageRank, & subgraph isomorphism Structure combined with rich semantic information Pattern mining, ranking and expert finding, & keyword search Structure and semantics combined with machine learning Belief propagation & collaborative filtering for recommendations

Expanding the Capabilities of BDA Data Parallelism Graph Parallelism Simple Analytics Aggregation Queries Log Processing Indexing Regression Classi fication Collaborative Filtering Probabilistic Network Analysis Contextual Predictive Analytics Graph Mining +? Do we need to augment Hadoop?

A Simple Large-Scale Graph Problem How many people are pointing to you and what s their relative importance? Depends on rank of who follows them Depends on rank of who follows her What s the rank of this user? Rank? Loops in graph Iterate! Graphics source: [Joseph Gonzalez (CMU)]

PageRank Performance Hadoop MapReduce 13.3 hrs GraphLab (Native Graph Computation Framework) 14 min MapReduce is not a good fit for graph-based computation but graph preprocessing is another story. Twitter Graph V =41M, E =1.4B 8-node Intel Sandy Bridge E3-1280 Cluster, 16GB/node, 10GbE, 2x SSDs (550 MB/s each)

MapReduce s Limitations Lots of data replication for independence Programmers must reimagine problems not a natural abstraction Independent Data Rows And, it was not designed for iterative computations and stores everything away at each step

Complicating things further More than 10 6 vertices have one neighbor. Number of Vertices Top 1% of vertices are High-Degree adjacent to Vertices 50% of the edges! Twitter Follow Graph V 41M, E 1.4B Out Degree Power-law graphs = highly uneven processing! Image source: [Wikipedia] [cmu.edu/~pegasus]

GraphLab: Distributed Graph Computation An open source collaboration with Carlos Guestrin (UW) et al. Program For This Run on This Machine 1 Machine 2 Master Slave Split High-Degree vertices

Gather-Apply-Scatter (GAS) Machine 1 Machine 2 Master Gather Y Y Y Σ 1 Σ + + + Σ 2 Mirror Apply + Y Scatter Σ 3 Machine 3 Σ 4 Machine 4 Mirror Mirror Graphics source: [Joseph Gonzalez (CMU)]

Approaches to Graph Computation Bulk Synchronous Processing (BSP) Graph-Parallel - Giraph on Hadoop (Inspired by Google Pregel) - Dryad (Microsoft Research) - Apache Hama on Hadoop (Twitter) Asynchronous Graph-Parallel - Galois (UT Austin) à Edge partitioning - GraphLab (CMU) à Vertex partitioning GraphLab has an edge. Runtime in Seconds 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 BSP Async 1 2 3 4 5 6 7 8 Number of CPUs Graphics source: [Joseph Gonzalez (CMU)]

The GraphLab Framework Graph Based Data Representation Update Functions User Computation Scheduler Consistency Model Graphics source: [Joseph Gonzalez (CMU)]

But GraphLab is only part of the picture

Distributed Graph-Parallel System Graph Storage and Query Value Image source: [Wikipedia]

Key Considerations How do we construct graphs? How do we compute on graphs? How do we store and query graphs? Scalable ETL Easy to program Convenient data connectors Scalable full graph computation Ef ficient processing Flexible & reliable MLDM support Graph-structured queries Low latency at high throughput Leverages popular storage models Graph Ingress Graph Compute Graph Storage

Graph Construction Relationship Graph Feature Extraction Graph Formation Social Networking Data Graph Construction Data-Parallel Graph Hadoop is perfect for graph construction! Image source: [Wikipedia]

Building Graphs for Practical Apps Raw Data Preprocessing Graph Formation Add Network Information In fluential Person Social Networking Extract User and Relationship Directed Graph N/A Hidden Topic analysis XML Docs Extract Doc & Words Bipartite (Doc, Words) Word Frequency or TFIDF Recommendation System Activity Logs Extract User Item and Rating Bipartite (User, item) Rating

And, in practice and at scale we must: Raw Data Preprocessing Graph Formation Add Network Information Finalize for Parallel Computation Minimize the use of system resources, like memory, storage, etc. Graph partitioning to ensure computational effort is load balanced Do our best to ensure the graph we generated is the one we intended to but the Data Scientist shouldn t be responsible for this domain expertise!

Graph Construction Library: Intel GraphBuilder Hidden Topic Analysis Relative Ranking Analysis Graph Computation Of floads domain expertise Written in Java for convenient integration with Hadoop Graph Abstraction MapReduce and applications Open source code available at: Data Store www.01.org/graphbuilder

GraphBuilder Data Flow Extract Transform Load Graph formation from data source(s) Apply cleaning and transformation Prepare for graph analytics HDFS DB XML Docs Feature Extraction Tabulation Graph Checks and Transformation Graph Compression, Partitioning, and Serialization App-Speci fic Code GraphBuilder Library

Challenge: Graph Partitioning Minimize communications by minimizing the number of machines vertex spans D C 1 1 2 Place about the same number of edges on each machine A 1 2 B

Dif ficult to Partition Power-Law graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04] Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. [Abou-Rjeili et al. 06] Vertex View http://inmaps.linkedinlabs.com/

Heuristic-Based Partitioning Strategies Random edge placement Edges are placed randomly by each system Greedy edge placement Global coordination for edge placement to minimize the vertex spanned Oblivious greedy placement implements a local version of the Greedy without global coordination

Greedy Algorithm Place edges on machines which already have the vertices on that edge while ensuring load balancing. A Master B Slave B C H F Machine 1 Machine 2 B E

Performance Effect Relative Runtime 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Random Oblivious Greedy Greedy PageRank Collaborative Filtering Shortest Path Performance is inversely proportional to replication. *Gonzalez et al., PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, [OSDI 12]

Speed of Graph Construction Wikipedia Graphs Word-Doc Graph 45 min V 54M, E 1.4B Link Graph 13 min V 20M, E 128M Extract Transform Load Graph Compression Custom plug-in code Link 60% 100 lines Word-Doc 5% 130 lines Hardware: 8 node cluster 1U Dual CPU (Intel SNB) Amazon build ZT systems 64 GB Memory, Four SATA Hard Drives Intel 10G Adapter and Switch Software: Apache Hadoop 1.0.1 GraphLab v2.1 GraphBuilder beta

Graph Storage and Query Existing (no-)sql solutions have limitations - Lack of fixed schema & incomplete knowledge of network structure - Indexing graphs for n-hop search does not scale well - Traditional approaches for graph query has super-linear scale (e.g. R-join for subgraph match has O(n 4 ) complexity 1 ) Requires fresh thinking about graph storage - Low-latency and high-throughput stores - New algorithms for fast random access - Parallel access for distributed computing 1 J. Cheng, J. X. Yu, B, Ding, P. S. Yu, and H. Wang. Fast Graph Pattern Matching in ICDE 2008

Intel Science & Technology Centers Serving as a bridge between commercial and academic research. Cloud Georgia Tech CMU UC Berkeley Princeton Brown Big Data Five focus areas: Databases & Analytics Math & Algorithms Visualization Architecture Streaming U Tenn Knoxville UW Seattle MIT UC Santa Barbara PSU Stanford

The Collaboration Continues ML and Analytics Toolkits Graph-Parallel Parallel ML Cluster API Distributed System GraphBuilder Data Parallel Hadoop MapReduce Hadoop HDFS and/or Graph DB Distributed GraphLab Graph Parallel Local Store Current areas for collaboration: 1. Advance distributed parallel graph database 2. Research GL fault tolerance and local storage support 3. Advance GB + GL for streaming and time-evolving apps

Summary Graph technologies enable exciting new Big Data Analytics applications - Expands the role of Hadoop - Requires new frameworks for graph processing Intel is partnering with academia to solve the right challenges Intel Labs is committed to: - Developing new technologies in this space - Contributing to the open source community We would like to hear from you!