Machine Learning over Big Data

Similar documents
Big Graph Processing: Some Background

Big Data Analytics. Lucas Rego Drumond

Spark and the Big Data Library

Large-Scale Data Processing

Apache Hama Design Document v0.6

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Graph Processing and Social Networks

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

The Power of Relationships

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Chapter 4: Artificial Neural Networks

Big Data Systems CS 5965/6965 FALL 2015

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Evaluating partitioning of big graphs

Big Data Analytics. Lucas Rego Drumond

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Presto/Blockus: Towards Scalable R Data Analysis

ANALYTICS IN BIG DATA ERA

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Unified Big Data Processing with Apache Spark. Matei

HIGH PERFORMANCE BIG DATA ANALYTICS

Parallelism and Cloud Computing

The Scientific Data Mining Process

MapReduce Approach to Collective Classification for Networks

Mining Large Datasets: Case of Mining Graph Data in the Cloud

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Big Data and Scripting Systems build on top of Hadoop

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Big Data looks Tiny from the Stratosphere

Large Scale Graph Processing with Apache Giraph

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Lecture 6. Artificial Neural Networks

Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks

Systems and Algorithms for Big Data Analytics

Scalable Machine Learning - or what to do with all that Big Data infrastructure

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Big Data and Data Science: Behind the Buzz Words

MLlib: Scalable Machine Learning on Spark

DATA ANALYSIS II. Matrix Algorithms

Distributed Structured Prediction for Big Data

Distributed R for Big Data

Challenges for Data Driven Systems

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Large Scale Learning

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Machine Learning Capacity and Performance Analysis and R

MapReduce/Bigtable for Distributed Optimization

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Big Data and Scripting Systems beyond Hadoop

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Analyzing Big Data with AWS

BIG DATA What it is and how to use?

Information Processing, Big Data, and the Cloud

InfiniteGraph: The Distributed Graph Database

CSCI567 Machine Learning (Fall 2014)

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Fast Analytics on Big Data with H20

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Towards running complex models on big data

Managing large clusters resources

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

How Companies are! Using Spark

Unsupervised Data Mining (Clustering)

Reference Architecture, Requirements, Gaps, Roles

Map-Reduce for Machine Learning on Multicore

Big Data Paradigms in Python

Architectures for massive data management

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Fast Multipole Method for particle interactions: an open source parallel library component

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Architectures for Big Data Analytics A database perspective

Transcription:

Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology

Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed Machine learning Conclusion

Big already Happened 1 Billion Tweets Per Week 750 Million Facebook Users 6 Billion Flickr Photos 48 Hours of Video Uploaded every Minute

4V in Big Characteristics 1.Volume 2.Variety 3.Velocity 57% 4.Value Description of the contents

Knowledge "If a tree falls in a forest and no one is around to hear it, does it make a sound?" --- George Berkeley Nobody Knows What Is In All These Unless Having Them Processed and Analyzed

Why do Machine Learning on Big? Machine learning incorporates data analyses covering predictive analytics, data mining, pattern recognition, and multivariate statistics. Basic Analytics Counts and Averages Preprocessing SQL Queries Human Scale Hand Crafted Advanced Analytics (Machine Learning) Discovering Patterns Making Predictions Multivariate Queries Machine Scale Driven Unfortunately traditional analytics tools are not well suited to capturing the value hidden in Big. 6

Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed Machine learning Conclusion

Challenge #1 Massive Scale ~1B node, do not fitting into the main memory of a single machine, a familiar problem!

Challenge #2 Gigantic Model Size > 10 11 parameters, do not fitting into the main memory of a single machine either!

Challenge #3 Huge Cognitive Space 1M ~ 1B categories are seen in modern extreme classification problems, knn? SVM?

Distributed ML is a necessity Feature Engineering Parameter Tuning Problem Definition Large Scale Application Learning Methods & Theory Learning Methods Efficient Algorithms Learning Theory Distributed Machine Learning Programming Models Parallel Computing Distributed Systems Resource Mgmt Distributed Storage and Computing

Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed Machine learning Conclusion

Categories of Distributed ML Distributed ML 01.-Parallel ML 02.Graph-Parallel ML

-Parallel ML Linear regression with gradient descent Repeat { (for every ) }

Map-reduce and data parallelism Batch gradient descent: Machine 1: Use 100 temp j (1) (h (x (i) ) y (i) ) i 1 x j (i) Machine 2: Use Machine 3: Use Combine: j : j 1 400 (temp (1) (2) j temp j temp (3) j temp (4) j ) Machine 4: Use

Map-reduce Computer 1 Training set Computer 2 Combine results Computer 3 Computer 4

Graph Parallel ML

Social Arithmetic: Label Propagation Sue Ann 50% What I list on my profile 40% Sue Ann Likes + 10% Carlos Like 40% I Like: 60% Cameras, 40% Biking Profile 80% Cameras 20% Biking Recurrence Algorithm: j Friends[i] Likes[i] W ij Likes[ j] iterate until convergence Parallelism: Compute all Likes[i] in parallel Me 50% 10% Carlos 50% Cameras 50% Biking 30% Cameras 70% Biking

PageRank (Centrality Measures) Iterate: Where: α is the random reset probability L[j] is the number of links on page j 1 2 3 4 5 6

Matrix Factorization Alternating Least Squares (ALS) m 1 Users Netflix Movi es x Users u i Movies m j User Factors (U) u 1 u 2 r 11 r 12 r 23 r 24 m 2 m 3 Movie Factors (M) Update Function computes:

Other Examples Statistical Inference in Relational Models Belief Propagation Gibbs Sampling Network Analysis Centrality Measures Triangle Counting Natural Language Processing CoEM Topic Modeling

Bulk Synchronous Parallel Iterations CPU 1 CPU 1 CPU 1 CPU 2 CPU 2 CPU 2 CPU 3 CPU 3 CPU 3 Barrier Barrier Barrier

Pregel: Bulk Synchronous Parallel Compute Communicate Barrier

Open Source Implementations PREGEL : https://plus.google.com/+pregel/ Giraph: http://incubator.apache.org/giraph/ Golden Orb: http://goldenorbos.org/

Curse of the Slow Job Iterations CPU 1 CPU 1 CPU 1 CPU 2 CPU 2 CPU 2 CPU 3 CPU 3 CPU 3 Barrier Barrier Barrier

Tradeoffs of the BSP Model Pros: Graph Parallel Relatively easy to build Deterministic execution Cons: Doesn t exploit the graph structure Can lead to inefficient systems

Asynchronous Computation Model Threads proceed to next iteration without waiting. An asynchronous variant: GraphLab: http://graphlab.org/

What is GraphLab?

The GraphLab Framework Graph Based Representation Update Functions User Computation Scheduler Consistency Model

Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Graph: Social Network Vertex : User profile text Current interests estimates Edge : Similarity weights

New Perspective on Partitioning Natural graphs have poor edge separators Classic graph partitioning tools (e.g., ParMetis, Zoltan ) fail Y Y CPU 1 CPU 2 Must synchronize many edges Natural graphs have good vertex separators Y Y Must synchronize a single vertex CPU 1 CPU 2

Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex pagerank(i, scope){ // Get Neighborhood data (R[i], W ij, R[j]) scope; // Update the vertex data R[ i] (1 ) j N[ i] W R[ j]; // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } ji

The Scheduler The scheduler determines the order that vertices are updated Scheduler ba hi CPU 1 CPU 2 a h b c e f g i j d k The process repeats until the scheduler is empty

Consistency Through Scheduling Edge Consistency Model: Two vertices can be Updated simultaneously if they do not share an edge. Graph Coloring: Two vertices can be assigned the same color if they do not share an edge. Phase 1 Phase 2 Phase 3 Execute in Parallel Synchronously Barrier Barrier Barrier

Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed Machine learning Conclusion

Dynamically Changing Graphs Example: Social Networks New users New Vertices New Friends New Edges How do you adaptively maintain computation: Trigger computation with changes in the graph Update interest estimates only where needed Exploit asynchrony Preserve consistency

Graph Partitioning How can you quickly place a large datagraph in a distributed environment: Edge separators fail on large power-law graphs Social networks, Recommender Systems, NLP Constructing vertex separators at scale: No large-scale tools! How can you adapt the placement in changing graphs?

New Computation model is needed???? BSP: safe by slow Async: fast but risky