MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)



Similar documents
Introduction to Parallel Programming and MapReduce

MapReduce (in the cloud)

Big Data Processing with Google s MapReduce. Alexandru Costan

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

MapReduce is a programming model and an associated implementation for processing

Map Reduce / Hadoop / HDFS

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

ImprovedApproachestoHandleBigdatathroughHadoop

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

Parallel Processing of cluster by Map Reduce

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

MapReduce Job Processing

Introduction to Hadoop

Big Data and Apache Hadoop s MapReduce

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Cloud Computing at Google. Architecture

16.1 MAPREDUCE. For personal use only, not for distribution. 333

The MapReduce Framework

A programming model in Cloud: MapReduce

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Data Science in the Wild

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

Introduction to Hadoop

Advanced Data Management Technologies

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Big Data With Hadoop

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Data-Intensive Computing with Map-Reduce and Hadoop

Big Application Execution on Cloud using Hadoop Distributed File System

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Sriram Krishnan, Ph.D.

LARGE-SCALE DATA PROCESSING WITH MAPREDUCE

2.1 Hadoop a. Hadoop Installation & Configuration

Programming Abstractions and Security for Cloud Computing Systems

Lecture Data Warehouse Systems

- Behind The Cloud -

HPC & Big Data. Adam S.Z Belloum Software and Network engineering group University of Amsterdam

Hadoop Design and k-means Clustering

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

CSE-E5430 Scalable Cloud Computing Lecture 2

Developing MapReduce Programs

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

MAPREDUCE Programming Model

MapReduce and the New Software Stack


Apache Hadoop. Alexandru Costan

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

A Survey of Cloud Computing Guanfeng Octides

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Lecture 10 - Functional programming: Hadoop and MapReduce

The Performance of MapReduce: An In-depth Study

Cloud Computing using MapReduce, Hadoop, Spark

Comparison of Different Implementation of Inverted Indexes in Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Principles of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. MapReduce. Charlie Garrod Christian Kästner

Distributed Execution Engines

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Parameterizable benchmarking framework for designing a MapReduce performance model

Distributed and Cloud Computing

Chapter 7. Using Hadoop Cluster and MapReduce

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Extreme Computing. Hadoop MapReduce in more detail.

MapReduce and Hadoop Distributed File System

Hadoop and Map-Reduce. Swati Gore

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

Hadoop Distributed File System Propagation Adapter for Nimbus

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

A METHODOLOGY FOR IDENTIFYING THE RELATIONSHIP BETWEEN PERFORMANCE FACTORS FOR CLOUD COMPUTING APPLICATIONS

Mobile Storage and Search Engine of Information Oriented to Food Cloud

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

The Google File System

Cloud Computing: MapReduce and Hadoop

Architectures for massive data management

5 SCS Deployment Infrastructure in Use

University of Maryland. Tuesday, February 2, 2010

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop Architecture. Part 1

Getting Started with Hadoop with Amazon s Elastic MapReduce

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Cloud Architecture and Virtualisation. Lecture 6 Data in clouds

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Hadoop WordCount Explained! IT332 Distributed Systems

Transcription:

MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004)

What it is MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

map and reduce functions map(string key, String value): reduce(string key, Iterator values) map (k1,v1) list(k2,v2) reduce (k2,list(v2)) list(v2)

sample problem: word counting counting the number of occurences of each word in a large collection of documents map(string key, String value): // key: document name (k1=filename) // value: document contents (v1) for each word w (= k2) in value: EmitIntermediate(w, "1" (= v2) );

sample problem: word counting reduce(string key, String value): // key: word (= k2) // value: list of v2 = list of "1" for each word w (= k2) in value: EmitIntermediate(w, "1" (= v2) );

sample problem: distributed grep grep (pattern matching according to a regular expression) in a large collection of documents map: k1 = filename k2 = line that matches pattern v2 = filename + linenumber within file reduce: output of k2 list(v2)

sample: URL access frequency counting the number of occurences of each URL in a large collection of documents which contain URL lists map: k1 = filename k2 = URL v2 = 1 reduce: for each k2: size (list(v2) )

sample: reverse web link graph all references to a Web page, given a large collection of documents which contain <source-url, target-url> pairs map: k1 = filename k2 = target-url v2 = source-url reduce: for each k2: output list(v2)

execution of a map reduce process

execution of a map reduce process The MapReduce library in the user program first shards the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

execution of a map reduce process When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. After successful completion, the output of the MapReduce execution is available in the R output files.

master data structures For each map task and reduce task: the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks) for each completed map task: the locations and sizes of the R intermediate file regions produced by the map task.

fault tolerance (1) worker failure: timeout check: master pings every worker periodically; if no response is received from a worker in a certain amount of time, the master marks the worker as failed. When a map task is executed first by worker A and then later executed by worker B (because A failed), all workers executing reduce tasks are notified of the reexecution. Any reduce task that has not already read the data from worker A will read the data from worker B. master failure: MapReduce computation aborted if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

fault tolerance (2) straggling workers: a machine takes an unusually long time to complete one of the map or reduce tasks in the computation. general mechanism to alleviate the problem of stragglers: When a MapReduce operation is close to completion, the master schedules backup executions of the remaining inprogress tasks. The task is marked as completed whenever either the primary or the backup execution completes.

fault tolerance (3) bad records: sometimes there are bugs in user code that cause the Map or Reduce functions to crash deterministically on certain records sometimes it is acceptable to ignore a few records, for example when doing statistical analysis on a large data set. Each worker process installs a signal handler that catches segmentation violations and bus errors. when the master has seen more than one failure on a particular record, it indicates that the record should be skipped when it issues the next re-execution of the corresponding Map or Reduce task

optimizations combiner function: in some cases, there is significant repetition in the intermediate keys produced by each map task, and the userspecified Reduce function is commutative and associative thus, the intermediate keys should be processed locally combiner function = reduce function except for output destination (intermediate file for combiner) the Combiner function is executed on each machine that performs a map task

sample: distributed grep The grep program scans through 1010 100- byte records, searching for a relatively rare three-character pattern (the pattern occurs in 92,337 records). The input is split into approximately 64MB pieces (M = 15000), and the en- tire output is placed in one file (R = 1).

usage statistics 2004-2007 http://www.niallkennedy.com/blog/2008/01/google-mapreduce-stats.html