Machine Learning over Big Data

Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology

Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed Machine learning Conclusion

Big already Happened 1 Billion Tweets Per Week 750 Million Facebook Users 6 Billion Flickr Photos 48 Hours of Video Uploaded every Minute

4V in Big Characteristics 1.Volume 2.Variety 3.Velocity 57% 4.Value Description of the contents

Knowledge "If a tree falls in a forest and no one is around to hear it, does it make a sound?" --- George Berkeley Nobody Knows What Is In All These Unless Having Them Processed and Analyzed

Why do Machine Learning on Big? Machine learning incorporates data analyses covering predictive analytics, data mining, pattern recognition, and multivariate statistics. Basic Analytics Counts and Averages Preprocessing SQL Queries Human Scale Hand Crafted Advanced Analytics (Machine Learning) Discovering Patterns Making Predictions Multivariate Queries Machine Scale Driven Unfortunately traditional analytics tools are not well suited to capturing the value hidden in Big. 6

Challenge #1 Massive Scale ~1B node, do not fitting into the main memory of a single machine, a familiar problem!

Challenge #2 Gigantic Model Size > 10 11 parameters, do not fitting into the main memory of a single machine either!

Challenge #3 Huge Cognitive Space 1M ~ 1B categories are seen in modern extreme classification problems, knn? SVM?

Distributed ML is a necessity Feature Engineering Parameter Tuning Problem Definition Large Scale Application Learning Methods & Theory Learning Methods Efficient Algorithms Learning Theory Distributed Machine Learning Programming Models Parallel Computing Distributed Systems Resource Mgmt Distributed Storage and Computing

Categories of Distributed ML Distributed ML 01.-Parallel ML 02.Graph-Parallel ML

-Parallel ML Linear regression with gradient descent Repeat { (for every ) }

Map-reduce and data parallelism Batch gradient descent: Machine 1: Use 100 temp j (1) (h (x (i) ) y (i) ) i 1 x j (i) Machine 2: Use Machine 3: Use Combine: j : j 1 400 (temp (1) (2) j temp j temp (3) j temp (4) j ) Machine 4: Use

Map-reduce Computer 1 Training set Computer 2 Combine results Computer 3 Computer 4

Graph Parallel ML

Social Arithmetic: Label Propagation Sue Ann 50% What I list on my profile 40% Sue Ann Likes + 10% Carlos Like 40% I Like: 60% Cameras, 40% Biking Profile 80% Cameras 20% Biking Recurrence Algorithm: j Friends[i] Likes[i] W ij Likes[ j] iterate until convergence Parallelism: Compute all Likes[i] in parallel Me 50% 10% Carlos 50% Cameras 50% Biking 30% Cameras 70% Biking

PageRank (Centrality Measures) Iterate: Where: α is the random reset probability L[j] is the number of links on page j 1 2 3 4 5 6

Matrix Factorization Alternating Least Squares (ALS) m 1 Users Netflix Movi es x Users u i Movies m j User Factors (U) u 1 u 2 r 11 r 12 r 23 r 24 m 2 m 3 Movie Factors (M) Update Function computes:

Other Examples Statistical Inference in Relational Models Belief Propagation Gibbs Sampling Network Analysis Centrality Measures Triangle Counting Natural Language Processing CoEM Topic Modeling

Bulk Synchronous Parallel Iterations CPU 1 CPU 1 CPU 1 CPU 2 CPU 2 CPU 2 CPU 3 CPU 3 CPU 3 Barrier Barrier Barrier

Pregel: Bulk Synchronous Parallel Compute Communicate Barrier

Open Source Implementations PREGEL : https://plus.google.com/+pregel/ Giraph: http://incubator.apache.org/giraph/ Golden Orb: http://goldenorbos.org/

Curse of the Slow Job Iterations CPU 1 CPU 1 CPU 1 CPU 2 CPU 2 CPU 2 CPU 3 CPU 3 CPU 3 Barrier Barrier Barrier

Tradeoffs of the BSP Model Pros: Graph Parallel Relatively easy to build Deterministic execution Cons: Doesn t exploit the graph structure Can lead to inefficient systems

Asynchronous Computation Model Threads proceed to next iteration without waiting. An asynchronous variant: GraphLab: http://graphlab.org/

What is GraphLab?

The GraphLab Framework Graph Based Representation Update Functions User Computation Scheduler Consistency Model

Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Graph: Social Network Vertex : User profile text Current interests estimates Edge : Similarity weights

New Perspective on Partitioning Natural graphs have poor edge separators Classic graph partitioning tools (e.g., ParMetis, Zoltan ) fail Y Y CPU 1 CPU 2 Must synchronize many edges Natural graphs have good vertex separators Y Y Must synchronize a single vertex CPU 1 CPU 2

Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex pagerank(i, scope){ // Get Neighborhood data (R[i], W ij, R[j]) scope; // Update the vertex data R[ i] (1 ) j N[ i] W R[ j]; // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } ji

The Scheduler The scheduler determines the order that vertices are updated Scheduler ba hi CPU 1 CPU 2 a h b c e f g i j d k The process repeats until the scheduler is empty

Consistency Through Scheduling Edge Consistency Model: Two vertices can be Updated simultaneously if they do not share an edge. Graph Coloring: Two vertices can be assigned the same color if they do not share an edge. Phase 1 Phase 2 Phase 3 Execute in Parallel Synchronously Barrier Barrier Barrier

Dynamically Changing Graphs Example: Social Networks New users New Vertices New Friends New Edges How do you adaptively maintain computation: Trigger computation with changes in the graph Update interest estimates only where needed Exploit asynchrony Preserve consistency

Graph Partitioning How can you quickly place a large datagraph in a distributed environment: Edge separators fail on large power-law graphs Social networks, Recommender Systems, NLP Constructing vertex separators at scale: No large-scale tools! How can you adapt the placement in changing graphs?

New Computation model is needed???? BSP: safe by slow Async: fast but risky