E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I

Transcription

1 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center September 18th,

2 Course Structure Class Data 09/04/14 09/11/14 09/18/14 09/25/14 10/02/14 10/09/14 10/16/14 10/23/14 10/30/14 11/06/14 11/13/14 11/20/14 11/27/14 12/04/14 12/11/14 & 12/12/14 Number Topics Covered Introduction to Big Data Analytics Big Data Analytics Platforms Big Data Storage and Processing Big Data Analytics Algorithms -- I Big Data Analytics Algorithms -- II Linked Big Data Analysis Graph Computing and Network Science Big Data Visualization Big Data Mobile Applications Large-Scale Machine Learning Big Data Analytics on Specific Processors Hardware and Cluster Platforms for Big Data Analytics Big Data Next Challenges IoT, Cognition, and Beyond Thanksgiving Holiday Final Projects Discussion (Optional) Two-Day Big Data Analytics Workshop Final Project Presentations 2

3 Interest Groups 3

4 Project List 4

5 Dataset List 5

6 Remind -- Hadoop-related Apache Projects Ambari : A web-based tool for provisioning, managing, and monitoring Hadoop clusters.it also provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive applications visually. Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. Spark : A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Tez : A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. ZooKeeper : A high-performance coordination service for distributed applications. 6

7 Mahout Overview -- I 7

8 Mahout Overview -- II 8

9 Mahout Overview -- III 9

10 R 10

11 Major Open Source Licenses Apache License, Version 2.0 (January 2004): Apache Software Foundation Derived works do not need to be open-sourced GNU License (GPLv2): Free Software Foundation s General Public License Derived works need to be open-sourced; Copyleft license 11

12 Hadoop Integration with R 12

13 Spark 1.1 and related efforts Ooyala Job Server Hive on Spark Pig on Spark DStream s: Streams of RDD s Spark Streaming real-time RDD-Based Graphs GraphX Graph (alpha) RDD-Based Matrices MLLib machine learning RDD-Based Tables Spark SQL Spark RDD API HDFS, S3, Cassandra YARN, Mesos, Standalone Releases: Spark 1.0(.2): Aug 05, 2014; Spark 1.1(.0): Sept 11,

14 Spark Concepts Spark focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. Linear Linear regression performance Spark vs Hadoop Spark: Cluster Computing with Working Sets. MateiZaharia, MosharafChowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud June

15 Spark Core History server for Spark UI Integration with YARN security model Unified job submission tool Java 8 support Internal engine improvements 15

16 Spark Streaming Web UI for streaming Graceful shutdown User-defined input streams Support for creating in Java Refactored API Stability improvements across the board Amazon Kinesis support Rate limiting for streams Support for polling Flume streams Streaming + ML: Streaming linear regressions 16

17 Spark MLlib v1.0 & v1.1 Sparse vector support Decision trees Linear algebra: SVD and PCA Contributors: 40 (v1.0) -> 68 Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec Statistics: sampling (core), correlations, hypothesis testing, random data generation Performance and scalability: major improvement to decision tree, tree aggregation Python API: decision tree, statistics, linear methods 17

18 Performance (v1.0 vs. v1.1) 18

19 GraphLab GraphLabis a high performance, distributed computation framework written in C++. It is an open source project using Apache License. While GraphLabwas originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; 19

20 GraphLab performance Graph size v.s. Machine size: Let's consider storing the topology (in CRS-like format) of a graph in a server with 1TB memory (assuming average vertex degree is 25): storage_size = (index_size+1) + edgelist_size 1 TB = ((#v+1) + #v*25) * 8bytes #v ~= 5 Billion Scale up & out: If the Hadoop based solution scales linearly to 18 million m/c, it is just equivalent to the GraphLab in terms of performance, but the cost is much higher

21 IBM System G 21

22 An important pillar for Big Data foundations UI / User App Builder Integration & Governance Graphs Streams Hadoop Data Explorer Warehouse Potential Connector Framework CM, RM, DM RDBMS Feeds Web 2.0 Web CRM, ERP File Systems 22

23 Graphs Graph Database RDF / Property Graph Attributes Topological Analytics Collective Graph Macro Graphical Models Activity Graph Micro & Reasoning Contextual Analysis Collective Analysis Cognitive Understanding 23

24 System G Graph Computing Tools Visualization Huge Network Visualization Network Propagation I2 3D Network Visualization Geo Network Visualization Graphical Model Visualization Analytics Communities Graph Search Network Info Flow Bayesian Networks Centralities Graph Query Shortest Paths Latent Net Inference Ego Net Features Graph Matching Graph Sampling Markov Networks Middleware Graph Processing Interface Database System G Assets Open Source IBM Product Hardware 24 BigInsights Hadoop Pthreads PERCS Coh. Clus. GBase(update, scan, operators, indexing)) HBase HDFS Shared Memory Run Time Library Graphs RDMA Distr. Memory RT Library 24 MPI Cluster (BladeCenter, BlueGene) Graph Data Interface Native Store DB2 RDF DB2 Graphs FPGA/ HMC Infosphere Streams (ISS) TinkerPop Compliant DBs

25 System G Network Science Analytics Tools Judgment Abstract comprehension Reasoning Cognitive Networks: Markovian& Bayesian Networks Deep Learning Tools Brain Analysis Tool Observations Intrinsic senses Cognitive Analytics: Visual Sentiment Analysis Emotion Analysis Spatiotemporal Analytics: Road Network Algorithms Spatiotemporal Data Mining Spatiotemporal Indexing Cognition Layer Text/Visual Sentiments, Feeing and Emotions Behavioral Analytics Anomaly Detection Tools Recommender Tools Semantics Layer Concept Layer Feature Layer Sensor Layer Multi-Modality Multi-Layer Behavior Analysis 25 25

26 System G Analytics Overview System G is a comprehensive set of graph analytics libraries for Big Data Analytics Communities Graph Search Network Info Flow Bayesian Networks Centralities Graph Query Shortest Paths Latent Net Inference Ego Net Features Graph Matching Graph Sampling Markov Networks Analytics target at big graphs Create HBase coprocessors to provide scalable graph analytics by distributing data and computation in a balanced way based on graph topology. They outperform MapReduce-based approaches by 2.8 times. Exploit RDMA and other hardware-based optimization for CPU intensive analytics to achieve high CPU utilization, low IO and good speed up Analytics target at dynamic graphs In some analytics, e.g., dynamic graph clustering coefficients analytics, System G showed up to 21x faster than GraphLab. Include incremental K core, evolution aware clustering, and streaming graph clustering coefficients 26 26

27 Analytics Characteristics 1 Analytics Exact or Approx. Graph size and dynamics K-core Exact Tested against > 200M edge graph K-neighborhood Exact Scale free K shortest path Exact Scale free Connected component Exact Tested against > 70M edge graph Pagerank Exact Tested against > 70M edge graph Graph search & recommendation Exact Scale free Probabilistic Inference in Bayesian Networks Approx. Constrained by physical memory XRIME scalable community detection Approx. Tested against > 5M edge graph Dynamic subgraphmatching Exact Tested against > 150K edge graph. Change up to 30% within 10 sec Incremental streaming K-core Exact Tested against > 16M edge graph Streaming graph clustering Streaming graph clustering coefficient Approxim ation Exact Tested against 400K updates/sec Tested against > 1.8B edge graph, up to 24M updates/sec 27 27

28 Analytics Characteristics 2 Analytics Run time characteristics Parallelization K-core Disk IO bound Distributed parallelized K-neighborhood Disk IO bound Distributed K shortest path Disk IO bound Distributed parallelized Connected component Disk IO bound Distributed parallelized Pagerank Disk IO bound Distributed parallelized Graph search & recommendation Mixed IO and CPU workload Probabilistic Inference in Bayesian Networks XRIME scalable community detection Dynamic subgraphmatching O(Nkr^w/P) Initially CPU intensive Fast indices updating. Indices update time for 30% graph change < 10 second Multithreading Multithreading Distributed parallelized Not parallelized Incremental streaming K-core Update time is O( E ) Not parallelized Streaming graph clustering Update time is O(M^2) 2 Not parallelized Streaming graph clustering coefficient Update time for each insertion/deletion is O(Degree) Not parallelized 28

29 Analytics Characteristics 3 Analytics Memory requirements Graph format requirements K-core O( V + E ) Edge list, provide format conversion K-neighborhood Scale free Edge list, provide format conversion K shortest path O(( E / V )^k) Edge list, provide format conversion Connected component O( V + E ) Edge list, provide format conversion Pagerank O( V + E ) Edge list, provide format conversion Graph search and recommendation Probabilistic Inference in Bayesian Networks XRIME scalable community detection O(( E / V )^3) O(Nkr^w) Scale free Adjacency list, provide format conversion Edge list, provide format conversion Adjacency list, provide format conversion Dynamic subgraphmatching 200MB for 150K edge graph Edge list, provide format conversion Incremental streaming K- core O( V + E ) Edge streams, provide format conversion Streaming graph clustering O( V + E ) Edge streams, provide format conversion Streaming graph clustering ceofficient O( V + E ) Edge streams, provide format conversion 29 29

30 Graph Analytics Based on GBase Move computation to data by utilizing HBase Co-processors Our scheme distributes the computation to where the data live on the backend servers, while existing schemes bring data to the client side for processing with a scheme that Shorter is better Novel maintenance algorithms to update analytic results to reflect dynamic changes in original input graphs Shorter is better 30 30

31 Centralities Objective Scalable algorithm to measure nodes centralities Algorithms Eigen centrality (PageRank) Degree centralities Approaches Map-reduce HBase Coprocessor that moves computation to data Application Find important nodes (e.g., influencers) in a graph Papers M. Canim, Y.-C. Chang, System G: Big, Rich Graph Data Analytics in the Cloud,, IEEE IC2E 2013 Performance Comparisons MapReduce Shorter is better System G Our Advantage: MapReduceis oblivious to the graph topology whereas System G takes advantage of it. 31

32 Egonet features and top-k shortest paths Objective: Discovering egonet features and finding topk shortest paths Approach: Scanning local regions with HBase Coprocessor to find k-nearest neighbors and k-egonet of network nodes BFS type of search between HBase Coprocessors while applying cut-off thresholds for discovered new paths Application/Use cases: Finding neighbors of important nodes in a large social graph Finding alternative communication paths between nodes Papers and Patents: Publication: M. Canim, Y.-C. Chang, System G Data Store: Big, Rich Graph Data Analytics in the Cloud, IEEE IC2E Patent: M. Canim, Y.-C. Chang, Distributed K-Shortest Path Search, filed. k-step neighbors induced subgraph, egonet 32 32

33 Communities K-Core First horizontally scaling solution for multi resolution community identification and maintenance Multi resolution k-core construction Incremental Maintenance Best paper award at IEEE BigData

34 K-core Key Observation Qualified Neighbor Count (QNC): number of neighbor vertices whose degree greater than or equal to k Core number <= QNC <= degree QNC can easily be computed, depends only on neighbors' degree Provides a tight bound over k-core subgraph 34

35 Network Information Flow Objective Analyze, predict and affect information flow in networks Algorithms Edge manipulation to minimize or maximize memes propagation Approaches Find k best edges to delete or add to affect the dominant eigen value of the network Application Affect the propagation of rumors or counter messages in social media Papers H. Tong and et. al, Gelling, and Melting, Large Graphs by Edge Manipulation,, ACM CIKM 2012, best paper award better Node Manipulation vs. Edge Manipulation Green: Node Deletion Red: Edge Deletion 35 35

36 Network Information Flow Advantage Existing approaches Understand tipping point of network information flow Our unique contributions Solution to the edge-based network information flow control problem, which is NP-complete, by exploiting network eigen properties Minimize or maximize the propagation Use eigen value perturbation to characterize nodes deletion impact (better) Minimizing Propagation Log (Infected Ratio) Maxmizing Propagation Log (Infected Ratio) Time Ticks Our Method Only need eigen computation once Impact of different edges are decoupled Complexity of O(n 2 ) (better) Time Ticks

37 Graph traversal for Recommendation & Visualization item user People who bought this also bought that.. Collaborative Filtering ==> 2-hop traversal & ranking For Visualization ==> 4-hop traversal & rankings IBM KnowledgeView 1-year Access Log: 72.3K users, 82.1K docs, and 1.74 million downloads TBD TBD Products Startup Open Sources *All performance numbers are preliminary 37 System G

38 SubGraph Matching Subgraph Matching on Dynamic Graphs To find matching subgraphs within a larger time-evolving graph with numerical labels Algorithm Graph indexing, filtering and pruning Approaches indexes graphs with numerical node/edge labels efficiently processes index updates while keeping pruning speed Application Pattern search (clique, communication patterns) on network graph offline index building online query processing 38 38

39 Matching Problem Statement pattern graph data graph 10 example matches Given a large dynamic graph G with numerical node/edge labels and a smaller query graph Q with user specified numerical node/edge labels (e.g., communication capacity), Goal return a set of subgraphs of G, each of which is structurally isomorphic to Q, and whose node/edge labels are compatible with Q (e.g., provide enough network capacity)

40 Matching Technique: Gradin Step 1: offline index building Create index of frequently occurring fragments for fast search Encode subgraphs into multi-dimensional vectors using DFScodes Used as keywords to created inverted indices Enables fast search and updates Step 2. online query processing Upon subgraph match query The query is decomposed into fragments and matched with index Collect all matches and join them to find suitable candidates

41 Matching Performance Implementation: Written in C++ Evaluated with real and synthetic graphs. Performance comparison B3000: BCUBE of 3K nodes CAIDA: 26K nodes, 106K edges Multi-dimensional search tree technique (labeled as UpdAll) 20x improvement 13x improvement 41 41

42 Graphical Model (with Markoivan Latent Inference) Objective: Bayesian Inference Approach: Multithreading Architecture-aware acceleration Work-stealing dynamic task stealing Application/Use cases: Anomaly detection Papers and Patents: Y.Xia, W.S. Lin, and C.-Y Lin Efficient Data Injection for Bayesian Network Based Anomaly Detection Overview Picture of the Analytics Highly optimized for multicore/manycore processors Model evidence in a Markov model Allows latent variables 42 42

43 Parallel Inference from Bayesian Network to Junction Tree 43 43

44 Smarter another Planet 44 44

45 Questions? 45