The Power of Relationships Opportunities and Challenges in Big Data Intel Labs Cluster Computing Architecture
Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to speci fications and product descriptions at any time, without notice. All products, dates, and figures speci fied are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published speci fications. Current characterized errata are available on request. Code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user Performance tests and ratings are measured using speci fic computer systems and/or components and re flect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or con figuration may affect actual performance. Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright 2013 Intel Corporation.
Target knows when you are pregnant. How company learn your secrets by Charles Duhigg in NY Times Magazine [Feb, 2012] As Pole s [Target statistician] computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a pregnancy prediction score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very speci fic stages of her pregnancy. Target analyst noted that sometime in the first 20 weeks, pregnant women load up on supplements like calcium, magnesium and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date. Image source: [NY Times]
Mining Relationships for Recommendations Customers Who Bought This Item Also Bought What? Dog Food #1 Dog Food? Milo s Meatball Treats #2 Greenies for Teeth #3 #527 Richell s Pet Pen Callaway Diablo Driver
Graphs are omnipresent! 100B Neuron 100T Relationships 1B Users 140B Friendships 1Trillion Pages 100s T Links Human Brain Social Network Internet Millions of Products & Users e-commerce 27M Users 70K Movies Online Services Large Biological Cell Networks Life Science Big in size and rich in metadata Image source: [Wikipedia][alz.org] [Facebook]
Use of Graphs: Evolution of Graph Applications Graph structure mining Shortest path, reachability, PageRank, & subgraph isomorphism Structure combined with rich semantic information Pattern mining, ranking and expert finding, & keyword search Structure and semantics combined with machine learning Belief propagation & collaborative filtering for recommendations
Expanding the Capabilities of BDA Data Parallelism Graph Parallelism Simple Analytics Aggregation Queries Log Processing Indexing Regression Classi fication Collaborative Filtering Probabilistic Network Analysis Contextual Predictive Analytics Graph Mining +? Do we need to augment Hadoop?
A Simple Large-Scale Graph Problem How many people are pointing to you and what s their relative importance? Depends on rank of who follows them Depends on rank of who follows her What s the rank of this user? Rank? Loops in graph Iterate! Graphics source: [Joseph Gonzalez (CMU)]
PageRank Performance Hadoop MapReduce 13.3 hrs GraphLab (Native Graph Computation Framework) 14 min MapReduce is not a good fit for graph-based computation but graph preprocessing is another story. Twitter Graph V =41M, E =1.4B 8-node Intel Sandy Bridge E3-1280 Cluster, 16GB/node, 10GbE, 2x SSDs (550 MB/s each)
MapReduce s Limitations Lots of data replication for independence Programmers must reimagine problems not a natural abstraction Independent Data Rows And, it was not designed for iterative computations and stores everything away at each step
Complicating things further More than 10 6 vertices have one neighbor. Number of Vertices Top 1% of vertices are High-Degree adjacent to Vertices 50% of the edges! Twitter Follow Graph V 41M, E 1.4B Out Degree Power-law graphs = highly uneven processing! Image source: [Wikipedia] [cmu.edu/~pegasus]
GraphLab: Distributed Graph Computation An open source collaboration with Carlos Guestrin (UW) et al. Program For This Run on This Machine 1 Machine 2 Master Slave Split High-Degree vertices
Gather-Apply-Scatter (GAS) Machine 1 Machine 2 Master Gather Y Y Y Σ 1 Σ + + + Σ 2 Mirror Apply + Y Scatter Σ 3 Machine 3 Σ 4 Machine 4 Mirror Mirror Graphics source: [Joseph Gonzalez (CMU)]
Approaches to Graph Computation Bulk Synchronous Processing (BSP) Graph-Parallel - Giraph on Hadoop (Inspired by Google Pregel) - Dryad (Microsoft Research) - Apache Hama on Hadoop (Twitter) Asynchronous Graph-Parallel - Galois (UT Austin) à Edge partitioning - GraphLab (CMU) à Vertex partitioning GraphLab has an edge. Runtime in Seconds 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 BSP Async 1 2 3 4 5 6 7 8 Number of CPUs Graphics source: [Joseph Gonzalez (CMU)]
The GraphLab Framework Graph Based Data Representation Update Functions User Computation Scheduler Consistency Model Graphics source: [Joseph Gonzalez (CMU)]
But GraphLab is only part of the picture
Distributed Graph-Parallel System Graph Storage and Query Value Image source: [Wikipedia]
Key Considerations How do we construct graphs? How do we compute on graphs? How do we store and query graphs? Scalable ETL Easy to program Convenient data connectors Scalable full graph computation Ef ficient processing Flexible & reliable MLDM support Graph-structured queries Low latency at high throughput Leverages popular storage models Graph Ingress Graph Compute Graph Storage
Graph Construction Relationship Graph Feature Extraction Graph Formation Social Networking Data Graph Construction Data-Parallel Graph Hadoop is perfect for graph construction! Image source: [Wikipedia]
Building Graphs for Practical Apps Raw Data Preprocessing Graph Formation Add Network Information In fluential Person Social Networking Extract User and Relationship Directed Graph N/A Hidden Topic analysis XML Docs Extract Doc & Words Bipartite (Doc, Words) Word Frequency or TFIDF Recommendation System Activity Logs Extract User Item and Rating Bipartite (User, item) Rating
And, in practice and at scale we must: Raw Data Preprocessing Graph Formation Add Network Information Finalize for Parallel Computation Minimize the use of system resources, like memory, storage, etc. Graph partitioning to ensure computational effort is load balanced Do our best to ensure the graph we generated is the one we intended to but the Data Scientist shouldn t be responsible for this domain expertise!
Graph Construction Library: Intel GraphBuilder Hidden Topic Analysis Relative Ranking Analysis Graph Computation Of floads domain expertise Written in Java for convenient integration with Hadoop Graph Abstraction MapReduce and applications Open source code available at: Data Store www.01.org/graphbuilder
GraphBuilder Data Flow Extract Transform Load Graph formation from data source(s) Apply cleaning and transformation Prepare for graph analytics HDFS DB XML Docs Feature Extraction Tabulation Graph Checks and Transformation Graph Compression, Partitioning, and Serialization App-Speci fic Code GraphBuilder Library
Challenge: Graph Partitioning Minimize communications by minimizing the number of machines vertex spans D C 1 1 2 Place about the same number of edges on each machine A 1 2 B
Dif ficult to Partition Power-Law graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04] Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. [Abou-Rjeili et al. 06] Vertex View http://inmaps.linkedinlabs.com/
Heuristic-Based Partitioning Strategies Random edge placement Edges are placed randomly by each system Greedy edge placement Global coordination for edge placement to minimize the vertex spanned Oblivious greedy placement implements a local version of the Greedy without global coordination
Greedy Algorithm Place edges on machines which already have the vertices on that edge while ensuring load balancing. A Master B Slave B C H F Machine 1 Machine 2 B E
Performance Effect Relative Runtime 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Random Oblivious Greedy Greedy PageRank Collaborative Filtering Shortest Path Performance is inversely proportional to replication. *Gonzalez et al., PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, [OSDI 12]
Speed of Graph Construction Wikipedia Graphs Word-Doc Graph 45 min V 54M, E 1.4B Link Graph 13 min V 20M, E 128M Extract Transform Load Graph Compression Custom plug-in code Link 60% 100 lines Word-Doc 5% 130 lines Hardware: 8 node cluster 1U Dual CPU (Intel SNB) Amazon build ZT systems 64 GB Memory, Four SATA Hard Drives Intel 10G Adapter and Switch Software: Apache Hadoop 1.0.1 GraphLab v2.1 GraphBuilder beta
Graph Storage and Query Existing (no-)sql solutions have limitations - Lack of fixed schema & incomplete knowledge of network structure - Indexing graphs for n-hop search does not scale well - Traditional approaches for graph query has super-linear scale (e.g. R-join for subgraph match has O(n 4 ) complexity 1 ) Requires fresh thinking about graph storage - Low-latency and high-throughput stores - New algorithms for fast random access - Parallel access for distributed computing 1 J. Cheng, J. X. Yu, B, Ding, P. S. Yu, and H. Wang. Fast Graph Pattern Matching in ICDE 2008
Intel Science & Technology Centers Serving as a bridge between commercial and academic research. Cloud Georgia Tech CMU UC Berkeley Princeton Brown Big Data Five focus areas: Databases & Analytics Math & Algorithms Visualization Architecture Streaming U Tenn Knoxville UW Seattle MIT UC Santa Barbara PSU Stanford
The Collaboration Continues ML and Analytics Toolkits Graph-Parallel Parallel ML Cluster API Distributed System GraphBuilder Data Parallel Hadoop MapReduce Hadoop HDFS and/or Graph DB Distributed GraphLab Graph Parallel Local Store Current areas for collaboration: 1. Advance distributed parallel graph database 2. Research GL fault tolerance and local storage support 3. Advance GB + GL for streaming and time-evolving apps
Summary Graph technologies enable exciting new Big Data Analytics applications - Expands the role of Hadoop - Requires new frameworks for graph processing Intel is partnering with academia to solve the right challenges Intel Labs is committed to: - Developing new technologies in this space - Contributing to the open source community We would like to hear from you!