E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I



Similar documents
E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Chase Wu New Jersey Ins0tute of Technology

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

How To Scale Out Of A Nosql Database

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Hadoop Ecosystem B Y R A H I M A.

Unified Big Data Processing with Apache Spark. Matei

Moving From Hadoop to Spark

Hadoop. Sunday, November 25, 12

Spark and the Big Data Library

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

How Companies are! Using Spark

Apache Flink Next-gen data analysis. Kostas

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

What s next for the Berkeley Data Analytics Stack?

Hadoop & Spark Using Amazon EMR

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Ali Ghodsi Head of PM and Engineering Databricks

Application Development. A Paradigm Shift

BIG DATA What it is and how to use?

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

How To Create A Data Visualization With Apache Spark And Zeppelin

Dell In-Memory Appliance for Cloudera Enterprise

Big Data and Scripting Systems build on top of Hadoop

Big Data Analytics Hadoop and Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data Research in the AMPLab: BDAS and Beyond

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Accelerating and Simplifying Apache

HIGH PERFORMANCE BIG DATA ANALYTICS

Beyond Hadoop with Apache Spark and BDAS

A Brief Introduction to Apache Tez

Bayesian networks - Time-series models - Apache Spark & Scala

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data and Scripting Systems build on top of Hadoop

Streaming items through a cluster with Spark Streaming

HDP Hadoop From concept to deployment.

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Challenges for Data Driven Systems

Luncheon Webinar Series May 13, 2013

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Advanced Big Data Analytics with R and Hadoop

Architectures for massive data management

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Large-Scale Data Processing

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

COMP9321 Web Application Engineering

CSE-E5430 Scalable Cloud Computing Lecture 11

The Internet of Things and Big Data: Intro

The basic data mining algorithms introduced may be enhanced in a number of ways.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Upcoming Announcements

BIG DATA TRENDS AND TECHNOLOGIES

From Spark to Ignition:

Architectures for Big Data Analytics A database perspective

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

ANALYTICS CENTER LEARNING PROGRAM

Big Data and Analytics: Challenges and Opportunities

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Data Warehouse design

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Massive Cloud Auditing using Data Mining on Hadoop

Spark: Cluster Computing with Working Sets

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data on Microsoft Platform

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Bringing Big Data to People

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Unified Big Data Analytics Pipeline. 连 城

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Comprehensive Analytics on the Hortonworks Data Platform

Brave New World: Hadoop vs. Spark

IBM Big Data Platform

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Real Time Data Processing using Spark Streaming

Big Data and Data Science: Behind the Buzz Words

Native Connectivity to Big Data Sources in MSTR 10

Transcription:

E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center September 18th, 2014 1

Course Structure Class Data 09/04/14 09/11/14 09/18/14 09/25/14 10/02/14 10/09/14 10/16/14 10/23/14 10/30/14 11/06/14 11/13/14 11/20/14 11/27/14 12/04/14 12/11/14 & 12/12/14 Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14-15 Topics Covered Introduction to Big Data Analytics Big Data Analytics Platforms Big Data Storage and Processing Big Data Analytics Algorithms -- I Big Data Analytics Algorithms -- II Linked Big Data Analysis Graph Computing and Network Science Big Data Visualization Big Data Mobile Applications Large-Scale Machine Learning Big Data Analytics on Specific Processors Hardware and Cluster Platforms for Big Data Analytics Big Data Next Challenges IoT, Cognition, and Beyond Thanksgiving Holiday Final Projects Discussion (Optional) Two-Day Big Data Analytics Workshop Final Project Presentations 2

Interest Groups http://www.ee.columbia.edu/~cylin/course/bigdata/getgroupinfo.html 3

Project List http://www.ee.columbia.edu/~cylin/course/bigdata/getprojectinfo.html 4

Dataset List http://www.ee.columbia.edu/~cylin/course/bigdata/getdatasetinfo.html 5

Remind -- Hadoop-related Apache Projects Ambari : A web-based tool for provisioning, managing, and monitoring Hadoop clusters.it also provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive applications visually. Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. Spark : A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Tez : A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. ZooKeeper : A high-performance coordination service for distributed applications. 6

Mahout Overview -- I 7

Mahout Overview -- II 8

Mahout Overview -- III 9

R 10

Major Open Source Licenses Apache License, Version 2.0 (January 2004): Apache Software Foundation Derived works do not need to be open-sourced GNU License (GPLv2): Free Software Foundation s General Public License Derived works need to be open-sourced; Copyleft license 11

Hadoop Integration with R 12

Spark 1.1 and related efforts Ooyala Job Server Hive on Spark Pig on Spark DStream s: Streams of RDD s Spark Streaming real-time RDD-Based Graphs GraphX Graph (alpha) RDD-Based Matrices MLLib machine learning RDD-Based Tables Spark SQL Spark RDD API HDFS, S3, Cassandra YARN, Mesos, Standalone Releases: Spark 1.0(.2): Aug 05, 2014; Spark 1.1(.0): Sept 11, 2014 13 http://www.meetup.com/spark-users/files/

Spark Concepts Spark focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. Linear Linear regression performance Spark vs Hadoop Spark: Cluster Computing with Working Sets. MateiZaharia, MosharafChowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010. 14

Spark Core History server for Spark UI Integration with YARN security model Unified job submission tool Java 8 support Internal engine improvements 15

Spark Streaming Web UI for streaming Graceful shutdown User-defined input streams Support for creating in Java Refactored API Stability improvements across the board Amazon Kinesis support Rate limiting for streams Support for polling Flume streams Streaming + ML: Streaming linear regressions 16

Spark MLlib v1.0 & v1.1 Sparse vector support Decision trees Linear algebra: SVD and PCA Contributors: 40 (v1.0) -> 68 Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec Statistics: sampling (core), correlations, hypothesis testing, random data generation Performance and scalability: major improvement to decision tree, tree aggregation Python API: decision tree, statistics, linear methods 17

Performance (v1.0 vs. v1.1) 18

GraphLab GraphLabis a high performance, distributed computation framework written in C++. It is an open source project using Apache License. While GraphLabwas originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; 19

GraphLab performance Graph size v.s. Machine size: Let's consider storing the topology (in CRS-like format) of a graph in a server with 1TB memory (assuming average vertex degree is 25): storage_size = (index_size+1) + edgelist_size 1 TB = ((#v+1) + #v*25) * 8bytes #v ~= 5 Billion Scale up & out: If the Hadoop based solution scales linearly to 18 million m/c, it is just equivalent to the GraphLab in terms of performance, but the cost is much higher

IBM System G http://systemg.research.ibm.com 21

An important pillar for Big Data foundations UI / User App Builder Integration & Governance Graphs Streams Hadoop Data Explorer Warehouse Potential Connector Framework CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems 22

Graphs Graph Database RDF / Property Graph Attributes Topological Analytics Collective Graph Macro Graphical Models Activity Graph Micro & Reasoning Contextual Analysis Collective Analysis Cognitive Understanding 23

System G Graph Computing Tools Visualization Huge Network Visualization Network Propagation I2 3D Network Visualization Geo Network Visualization Graphical Model Visualization Analytics Communities Graph Search Network Info Flow Bayesian Networks Centralities Graph Query Shortest Paths Latent Net Inference Ego Net Features Graph Matching Graph Sampling Markov Networks Middleware Graph Processing Interface Database System G Assets Open Source IBM Product Hardware 24 BigInsights Hadoop Pthreads PERCS Coh. Clus. GBase(update, scan, operators, indexing)) HBase HDFS Shared Memory Run Time Library Graphs RDMA Distr. Memory RT Library 24 MPI Cluster (BladeCenter, BlueGene) Graph Data Interface Native Store DB2 RDF DB2 Graphs FPGA/ HMC Infosphere Streams (ISS) TinkerPop Compliant DBs

System G Network Science Analytics Tools Judgment Abstract comprehension Reasoning Cognitive Networks: Markovian& Bayesian Networks Deep Learning Tools Brain Analysis Tool Observations Intrinsic senses Cognitive Analytics: Visual Sentiment Analysis Emotion Analysis Spatiotemporal Analytics: Road Network Algorithms Spatiotemporal Data Mining Spatiotemporal Indexing Cognition Layer Text/Visual Sentiments, Feeing and Emotions Behavioral Analytics Anomaly Detection Tools Recommender Tools Semantics Layer Concept Layer Feature Layer Sensor Layer Multi-Modality Multi-Layer Behavior Analysis 25 25

System G Analytics Overview System G is a comprehensive set of graph analytics libraries for Big Data Analytics Communities Graph Search Network Info Flow Bayesian Networks Centralities Graph Query Shortest Paths Latent Net Inference Ego Net Features Graph Matching Graph Sampling Markov Networks Analytics target at big graphs Create HBase coprocessors to provide scalable graph analytics by distributing data and computation in a balanced way based on graph topology. They outperform MapReduce-based approaches by 2.8 times. Exploit RDMA and other hardware-based optimization for CPU intensive analytics to achieve high CPU utilization, low IO and good speed up Analytics target at dynamic graphs In some analytics, e.g., dynamic graph clustering coefficients analytics, System G showed up to 21x faster than GraphLab. Include incremental K core, evolution aware clustering, and streaming graph clustering coefficients 26 26

Analytics Characteristics 1 Analytics Exact or Approx. Graph size and dynamics K-core Exact Tested against > 200M edge graph K-neighborhood Exact Scale free K shortest path Exact Scale free Connected component Exact Tested against > 70M edge graph Pagerank Exact Tested against > 70M edge graph Graph search & recommendation Exact Scale free Probabilistic Inference in Bayesian Networks Approx. Constrained by physical memory XRIME scalable community detection Approx. Tested against > 5M edge graph Dynamic subgraphmatching Exact Tested against > 150K edge graph. Change up to 30% within 10 sec Incremental streaming K-core Exact Tested against > 16M edge graph Streaming graph clustering Streaming graph clustering coefficient Approxim ation Exact Tested against 400K updates/sec Tested against > 1.8B edge graph, up to 24M updates/sec 27 27

Analytics Characteristics 2 Analytics Run time characteristics Parallelization K-core Disk IO bound Distributed parallelized K-neighborhood Disk IO bound Distributed K shortest path Disk IO bound Distributed parallelized Connected component Disk IO bound Distributed parallelized Pagerank Disk IO bound Distributed parallelized Graph search & recommendation Mixed IO and CPU workload Probabilistic Inference in Bayesian Networks XRIME scalable community detection Dynamic subgraphmatching O(Nkr^w/P) Initially CPU intensive Fast indices updating. Indices update time for 30% graph change < 10 second Multithreading Multithreading Distributed parallelized Not parallelized Incremental streaming K-core Update time is O( E ) Not parallelized Streaming graph clustering Update time is O(M^2) 2 Not parallelized Streaming graph clustering coefficient Update time for each insertion/deletion is O(Degree) Not parallelized 28

Analytics Characteristics 3 Analytics Memory requirements Graph format requirements K-core O( V + E ) Edge list, provide format conversion K-neighborhood Scale free Edge list, provide format conversion K shortest path O(( E / V )^k) Edge list, provide format conversion Connected component O( V + E ) Edge list, provide format conversion Pagerank O( V + E ) Edge list, provide format conversion Graph search and recommendation Probabilistic Inference in Bayesian Networks XRIME scalable community detection O(( E / V )^3) O(Nkr^w) Scale free Adjacency list, provide format conversion Edge list, provide format conversion Adjacency list, provide format conversion Dynamic subgraphmatching 200MB for 150K edge graph Edge list, provide format conversion Incremental streaming K- core O( V + E ) Edge streams, provide format conversion Streaming graph clustering O( V + E ) Edge streams, provide format conversion Streaming graph clustering ceofficient O( V + E ) Edge streams, provide format conversion 29 29

Graph Analytics Based on GBase Move computation to data by utilizing HBase Co-processors Our scheme distributes the computation to where the data live on the backend servers, while existing schemes bring data to the client side for processing with a scheme that Shorter is better Novel maintenance algorithms to update analytic results to reflect dynamic changes in original input graphs Shorter is better 30 30

Centralities Objective Scalable algorithm to measure nodes centralities Algorithms Eigen centrality (PageRank) Degree centralities Approaches Map-reduce HBase Coprocessor that moves computation to data Application Find important nodes (e.g., influencers) in a graph Papers M. Canim, Y.-C. Chang, System G: Big, Rich Graph Data Analytics in the Cloud,, IEEE IC2E 2013 Performance Comparisons MapReduce Shorter is better System G Our Advantage: MapReduceis oblivious to the graph topology whereas System G takes advantage of it. 31

Egonet features and top-k shortest paths Objective: Discovering egonet features and finding topk shortest paths Approach: Scanning local regions with HBase Coprocessor to find k-nearest neighbors and k-egonet of network nodes BFS type of search between HBase Coprocessors while applying cut-off thresholds for discovered new paths Application/Use cases: Finding neighbors of important nodes in a large social graph Finding alternative communication paths between nodes Papers and Patents: Publication: M. Canim, Y.-C. Chang, System G Data Store: Big, Rich Graph Data Analytics in the Cloud, IEEE IC2E 2013. Patent: M. Canim, Y.-C. Chang, Distributed K-Shortest Path Search, filed. k-step neighbors induced subgraph, egonet 32 32

Communities K-Core First horizontally scaling solution for multi resolution community identification and maintenance Multi resolution k-core construction Incremental Maintenance Best paper award at IEEE BigData 2013 33 33

K-core Key Observation Qualified Neighbor Count (QNC): number of neighbor vertices whose degree greater than or equal to k Core number <= QNC <= degree QNC can easily be computed, depends only on neighbors' degree Provides a tight bound over k-core subgraph 34

Network Information Flow Objective Analyze, predict and affect information flow in networks Algorithms Edge manipulation to minimize or maximize memes propagation Approaches Find k best edges to delete or add to affect the dominant eigen value of the network Application Affect the propagation of rumors or counter messages in social media Papers H. Tong and et. al, Gelling, and Melting, Large Graphs by Edge Manipulation,, ACM CIKM 2012, best paper award better Node Manipulation vs. Edge Manipulation Green: Node Deletion Red: Edge Deletion 35 35

Network Information Flow Advantage Existing approaches Understand tipping point of network information flow Our unique contributions Solution to the edge-based network information flow control problem, which is NP-complete, by exploiting network eigen properties Minimize or maximize the propagation Use eigen value perturbation to characterize nodes deletion impact (better) Minimizing Propagation Log (Infected Ratio) Maxmizing Propagation Log (Infected Ratio) Time Ticks Our Method Only need eigen computation once Impact of different edges are decoupled Complexity of O(n 2 ) (better) 36 36 Time Ticks

Graph traversal for Recommendation & Visualization item user People who bought this also bought that.. Collaborative Filtering ==> 2-hop traversal & ranking For Visualization ==> 4-hop traversal & rankings IBM KnowledgeView 1-year Access Log: 72.3K users, 82.1K docs, and 1.74 million downloads TBD TBD Products Startup Open Sources *All performance numbers are preliminary 37 System G

SubGraph Matching Subgraph Matching on Dynamic Graphs To find matching subgraphs within a larger time-evolving graph with numerical labels Algorithm Graph indexing, filtering and pruning Approaches indexes graphs with numerical node/edge labels efficiently processes index updates while keeping pruning speed Application Pattern search (clique, communication patterns) on network graph offline index building online query processing 38 38

Matching Problem Statement pattern graph data graph 10 example matches 1 1 1 5 2 3 5 3 2 1 2 20 2 100 Given a large dynamic graph G with numerical node/edge labels and a smaller query graph Q with user specified numerical node/edge labels (e.g., communication capacity), Goal return a set of subgraphs of G, each of which is structurally isomorphic to Q, and whose node/edge labels are compatible with Q (e.g., provide enough network capacity) 1 39 39 39

Matching Technique: Gradin Step 1: offline index building Create index of frequently occurring fragments for fast search Encode subgraphs into multi-dimensional vectors using DFScodes Used as keywords to created inverted indices Enables fast search and updates Step 2. online query processing Upon subgraph match query The query is decomposed into fragments and matched with index Collect all matches and join them to find suitable candidates 40 40 40

Matching Performance Implementation: Written in C++ Evaluated with real and synthetic graphs. Performance comparison B3000: BCUBE of 3K nodes CAIDA: 26K nodes, 106K edges Multi-dimensional search tree technique (labeled as UpdAll) 20x improvement 13x improvement 41 41

Graphical Model (with Markoivan Latent Inference) Objective: Bayesian Inference Approach: Multithreading Architecture-aware acceleration Work-stealing dynamic task stealing Application/Use cases: Anomaly detection Papers and Patents: Y.Xia, W.S. Lin, and C.-Y Lin Efficient Data Injection for Bayesian Network Based Anomaly Detection Overview Picture of the Analytics Highly optimized for multicore/manycore processors Model evidence in a Markov model Allows latent variables 42 42

Parallel Inference from Bayesian Network to Junction Tree 43 43

Smarter another Planet 44 44

Questions? 45