Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013



Similar documents
Big Data and Apache Hadoop s MapReduce

MapReduce (in the cloud)

Introduction to Parallel Programming and MapReduce

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Machine Learning over Big Data

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

16.1 MAPREDUCE. For personal use only, not for distribution. 333

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Introduction to Hadoop

Big Data Analytics. Lucas Rego Drumond

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MAPREDUCE Programming Model

Big Data Processing with Google s MapReduce. Alexandru Costan

Introduction to Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2


MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

The MapReduce Framework

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Lecture 10 - Functional programming: Hadoop and MapReduce

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Big Data Analytics Systems: What Goes Around Comes Around

ImprovedApproachestoHandleBigdatathroughHadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

MapReduce and Hadoop Distributed File System

Big Data With Hadoop

Big Data Analytics. Lucas Rego Drumond

Chapter 7. Using Hadoop Cluster and MapReduce

Cloud Computing at Google. Architecture

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Map Reduce / Hadoop / HDFS

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

Big Graph Processing: Some Background

Comparison of Different Implementation of Inverted Indexes in Hadoop

Improving MapReduce Performance in Heterogeneous Environments

Advanced Data Management Technologies

Optimization and analysis of large scale data sorting algorithm based on Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

The Hadoop Framework

Graph Processing and Social Networks

A programming model in Cloud: MapReduce

BIG DATA, MAPREDUCE & HADOOP

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Data Science in the Wild

Evaluating partitioning of big graphs

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop Parallel Data Processing

Parallel Processing of cluster by Map Reduce

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

- Behind The Cloud -

Lecture Data Warehouse Systems

Apache Flink Next-gen data analysis. Kostas

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

MapReduce is a programming model and an associated implementation for processing

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Introduction to Hadoop

Hadoop and Map-reduce computing

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Hadoop and Map-Reduce. Swati Gore

What is cloud computing?

MapReduce: Algorithm Design Patterns

Developing MapReduce Programs

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Data-Intensive Computing with Map-Reduce and Hadoop

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to DISC and Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

Energy Efficient MapReduce

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Keywords: Big Data, HDFS, Map Reduce, Hadoop

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

Hadoop Design and k-means Clustering

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Transcription:

MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons Graph Lab Graph Parallel Abstraction MapReduce vs Pregel vs GraphLab Implementation Issues Pros and Cons Motivation Motivation The Age of Big Data ~ 45 Billion Webpages ~ 1.06 Billion Facebook Users ~24 Million Wikipedia Pages Infeasible to analyze on a single machine Solution: Distribute across many computers ~6 Billion Flickr Photos Challenges: We repeatedly solve the same systemlevel problems Communication/Coordination among the nodes Synchronization, Race Conditions Load Balancing, Locality, Fault Tolerance, Need a higher level programming Model That hides these messy details Applies to a large number of problems 1

MapReduce: Overview MapReduce: Overview A programming model for large-scale data-parallel applications Introduced by Google (Dean & Ghemawat, OSDI 04) Petabytes of data processed on clusters with thousands of commodity machines Suitable for programs that can be decomposed in many embarrassingly parallel tasks Hides low level parallel programming details Programmers have to specify two primary methods Map: processes input data and generates intermediate key/value pairs Reduce: aggregates and merges all intermediate values associated with the same key map(k,v)(k,v ) * Mapper Reduce(k,<v > * )(k,v ) Reducer Input Output Example: Count word frequencies from web pages Input: Web documents Key = web document URL, Value = document content Output: Frequencies of individual words Steps 1: specify a Map function Example: Count word frequencies from web pages Step 2: specify a Reduce function Collect the partial sums provided by the map function Compute the total sum for each individual word Input: Intermediate files <key=word, value=partialcount*> Input: <key= url, value=content> < www.abc.com, abc ab cc ab> key = abc values = 1 key = ab values = 1, 1 key = cc values = 1 Output: <key= word, value=partialcount>* < abc, 1> < ab,1> < cc,1> < ab,1> < ab,2> < abc,1> < cc,1> Output <key=word, value=totalcount>* 2

Example code: Count word How MapReduce Works? void map(string key, String value): // key: webpage url // value: webpage contents for each word w in value: EmitIntermediate(w, "1"); 16MB-64 MB per split M splits Special copy of the program void reduce(string key, Iterator partialcounts): // key: a word // partialcounts: a list of aggregated partial counts int result = 0; for each pc in partialcounts: result += ParseInt(pc); Emit(AsString(result)); Ref: J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004 Implementation Details: Scheduling Fault Tolerance One Master, many workers MapReduce Library splits input data into M pieces (16MB or 64MB per piece, uses GFS) Master assigns each idle worker to a map or reduce task Worker completes the map task, buffers the intermediate (key,value) in memory, and periodically writes to local disk Location of buffered pairs are returned to Master Master assigns completed map tasks to Reduce workers Reduce worker reads the intermediate files using RPC Sorts the keys and performs reduction Worker Failure: Master pings the workers periodically If no response, then master marks the worker as failed Any map task or reduce task in progress is marked for rescheduling Completed reduce tasks don t have to be recomputed Master Failure Master writes periodic checkpoints to GFS. On failure, new master recovers to that checkpoint and continues Often not handled, aborts if master fails (failure of master is less probable) 3

Locality Bandwidth is an important resource Communicating large datasets to worker nodes can be very expensive Do not transfer data to worker Assign task to the worker that has the data locally Create multiple replications of data (Typically 3) Master assigns to one of these computers having the data in local file system Other Refinements Task Granularity: M map tasks, R reduce tasks We want to make M and R larger Dynamic load balancing Faster recovery from failure BUT, increases the number of scheduling decisions increases with M and R Finally we ll get R output files. So R should not be too large Typical settings: for 2000 worker machines: M = 200,000 and R = 5000 Other Refinements: Backup Tasks Some machines can be extremely slow ( straggler ) Perhaps a bad disk that frequently encounters correctable errors Solution: Backup tasks Near the end of map reduce, master schedules some backup tasks for each of the remaining tasks A task is marked as complete if either the primary or the backup execution completes Results: Sorting M=15k, R=4k, ~1800 machines 4

MapReduce Programs in Google 2/28/2013 Example: Word Frequency: MAP Word Frequency: REDUCE #include mapreduce/mapreduce.h // User's map function class WordCounter : public Mapper { public: virtual void Map(const MapInput& input) { //perform map operation, parse input... //for each word Emit(word, 1 ) } }; REGISTER_MAPPER(WordCounter); // User's reduce function class Adder: public Reducer { virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt(input->value()); input->nextvalue(); } // Emit sum for input->key() Emit(IntToString(value)); } }; REGISTER_REDUCER(Adder); Word Frequency: MAIN (Simplified) MapReduce Instances at Google int main(int argc, char** argv) { MapReduceSpecification spec; // Store list of input files into "spec MapReduceInput* input = spec.add_input(); input->set_mapper_class("wordcounter"); //specify output files MapReduceOutput* out = spec.output(); out->set_reducer_class("adder"); // Tuning parameters: use at most 2000 spec.set_machines(2000); // Now run it MapReduceResult result; MapReduce(spec, &result)); return 0; } Exponentially growing Jan 2003 Jan 2004 Jan 2005 Jan 2006 Jan 2007 Ref: PACT 06 Keynote slides by Jeff Dean, Google, Inc. 5

Pros: MapReduce Simplicity of the model Programmers specifies few simple methods that focuses on the functionality not on the parallelism Code is generic and portable across systems Scalability Scales easily for large number of clusters with thousands of machines Applicability to many different systems and a wide variety of problems Distributed Grep, Sort, Inverted Index, Word Frequency count, etc. Cons: MapReduce Restricted programming constructs (only map & reduce) Does not scale well for dependent tasks (for example Graph problems) Does not scale well for iterative algorithms (very common in machine learning) Summary: MapReduce MapReduce Restricted but elegant solution High level abstraction Implemented in C++ Many Open-source Implementation Hadoop (Distributed, Java) Phoenix, Metis (Shared memory, for multicore, C++ ) GraphLab 6

GraphLab: Motivation Graph-based Abstraction MapReduce is great for Data Parallel applications Can be easily decomposed using map/reduce Assumes independence among the tasks Independence assumption can be too restrictive Many interesting problems involve graphs Need to model dependence/interactions among entities Extract more signal from noisy data MapReduce is not well suited for these problems Machine learning practitioners typically had two choices: Simple algorithms + Big data vs Powerful algorithms (with dependency)+ Small data Graph-based Abstraction Powerful algorithms + Big data Graph parallel programming models: Pregel: Bulk Synchronous Parallel Model (Google, SIGMOD 2010) GraphLab: asynchronous model [UAI 2010, VLDB 2012] Bulk Synchronous Parallel Model: Pregel [Malewicz et al. 2010] Tradeoffs of the BSP Model Compute Communicate Barrier Pros: Scales better than MapReduce for Graphs Relatively easy to build Deterministic execution Cons: Inefficient if different regions of the graph converge at different speed Can suffer if one task is more expensive than the others Runtime of each phase is determined by the slowest machine http://graphlab.org/powergraph-presented-at-osdi/ 7

Runtime in Seconds 2/28/2013 Synchronous vs Asynchronous Belief Propagation GraphLab vs. Pregel (Page Rank) [Gonzalez, Low, Guestrin. 09] [Low et al. PVLDB 12] 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Bulk Synchronous (e.g., Pregel) Asynchronous Splash BP 1 2 3 4 5 6 7 8 Number of CPUs PageRank (25M Vertices, 355M Edges, 16 processors) GraphLab Framework: Overview GraphLab allows asynchronous iterative computation The GraphLab abstraction consists of 3 key elements: Data Graph Update Functions Sync Operation We explain GraphLab using PageRank example Case Study: PageRank Iterate: Where: α is the random reset probability L[j] is the number of links on page j 1 2 3 http://graphlab.org/powergraph-presented-at-osdi/ 4 6 8

1. Data Graph http://graphlab.org/powergraph-presented-at-osdi/ 1. Data Graph Data Graph: a directed acyclic graph G=(V,E,D) Data D refers to model parameters, algorithm states and other related data Graph: Social Network Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights PageRank: G = (V,E,D) Each vertex (v) corresponds to a webpage Each edge (u,v) corresponds to a link from (u v) Vertex data D v stores the rank of the webpage i.e. R(v) Edge data: D uv stores the weight of the link (u v) R[ v] (1 ) W R[ u] uv un [ v] 2. Update Functions 2. Update Function R[ v] (1 ) W R[ u]; uv un [ v] An update function is a user defined program (similar to map) Applied to a vertex, transforms the data in the scope of the vertex pagerank(v, scope){ // Get Neighborhood data (R[v], W uv, R[u]) scope; // Get old page rank R old [v] R[v] // Update the vertex data R[ v] (1 ) W R[ u]; uv un [ v] // Reschedule Neighbors if needed if R old [v]-r[v] >ε reschedule_neighbors_of(v); } struct pagerank : public iupdate_functor<graph, pagerank> { void operator()(icontext_type& context) { double sum = 0; foreach ( edge_type edge, context.in_edges() ) sum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()); double& rank = context.vertex_data(); double old_rank = rank; rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(rank old_rank) if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } }; 9

3. Sync Operation Scheduling Global operation, usually performed periodically in the background Useful for maintaining some global statistics of the algorithm Example: PageRank may want to return a list of 100 top ranked web pages Determine the global convergence criteria. For example, estimate log-likelihood. Estimate total log-likelihood for Expectation Maximization Similar to Reduce functionality in MapReduce [Low et al. PVLDB 12] GraphLab: Hello World! GraphLab Tutorials #include <graphlab.hpp> int main(int argc, char** argv) { graphlab::mpi_tools::init(argc, argv); Graphlab::distributed_control dc; dc.cout()<< "Hello World!\n"; graphlab::mpi_tools::finalize(); } For more interesting examples, check: http://docs.graphlab.org/using_graphlab.html A step by step tutorial for implementing PageRank in GraphLab Let the file name: my_first_app.cpp Use make to build Excecute: mpiexec -n 4./my_first_app dc.cout() provides a wrapper around standard std::cout When used in a distributed environment, only one copy will print, even though all machines execute it. 10

Few Additional Details Fault Tolerance using checkpointing The GraphLab described here is GraphLab 2.1 GraphLab for distributed systems The first version was proposed only for multicore systems Recently, PowerGraph was proposed GraphLab on Cloud Tradeoffs of GraphLab Pros: Allows dynamic asynchronous scheduling More expressive consistency model Faster and more efficient runtime performance Cons: Non-deterministic execution Substantially more complicated to implement Summary References MapReduce: efficient for independent tasks Simple framework Independence assumption can be too restrictive Not scalable for graphs or dependent tasks, or iterative tasks Pregel : Bulk Synchronous Parallel Models Can model dependent and iterative tasks Easy to Build, Deterministic Suffers from inefficiencies due to synchronous scheduling GraphLab: Asynchrnous Model Can model dependent and iterative tasks Fast, efficient, and expressive Introduces non-determinism, relatively complex implementation MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean & Sanjay Ghemawat, OSDI 2004. GraphLab: A new framework for parallel machine learning, Low et al., UAI 2010. GraphLab: a distributed framework for machine learning in the cloud, Low et al., PVLDB 2012. Pregel: a system for large-scale graph processing, Malewicz et al., SIGMOD 2010. Parallel Splash Belief Propagation, Gonzalez et al., Journal of Machine Learning Research, 2010. 11

Scheduler 2/28/2013 Disclaimer Additional GraphLab Implementation Issues Many figures and illustrations were collected from Carlos Guestrin s GraphLab tutorials http://docs.graphlab.org/using_graphlab.html Scheduler Consistency Model Scheduling Scheduling The scheduler determines the order that vertices are updated CPU 1 a b c d b a e f g h i h i j k CPU 2 What happens for boundary vertices (across machines)? Data is classified as edge data and vertex data Partition a huge graph across multiple machines Ghost vertices (along the cut) maintains adjacency information Graph must be cut efficiently. Use parallel graph partitioning tools (ParMetis) The process repeats until the scheduler is empty 12

Consistency Model Race conditions may occur if updating shared data If overlapping computations run simultaneously Consistency Model: How to Avoid Race? Ensure update functions for two vertices simultaneously operate only if two scopes do not overlap Three consistency models: Full consistency Edge consistency Vertex consistency Consistency vs Parallelism 13

Execute in Parallel Synchronously Barrier Barrier Barrier 2/28/2013 Consistency Model: Implementation Consistency using Distributed Locks Option 1: Chromatic Engine: Distributed Locking Graph coloring: neighboring vertices have different colors Simultaneous update only for vertices with same color Phase 1 Phase 2 Phase 3 Data Graph Partition Node 1 Node 2 Lock Pipeline Non-blocking locks allow computation to proceed while locks/data are requested. Request locks in a canonical order to avoid deadlock 14