MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context



Similar documents
Big Data Processing with Google s MapReduce. Alexandru Costan

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Cloud Computing at Google. Architecture

Big Data With Hadoop

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

CSE-E5430 Scalable Cloud Computing Lecture 2

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Parallel Processing of cluster by Map Reduce

From GWS to MapReduce: Google s Cloud Technology in the Early Days

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Big Data and Apache Hadoop s MapReduce


Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce (in the cloud)

Keywords: Big Data, HDFS, Map Reduce, Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Chapter 7. Using Hadoop Cluster and MapReduce

NoSQL. Thomas Neumann 1 / 22

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Can the Elephants Handle the NoSQL Onslaught?

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Map Reduce / Hadoop / HDFS

Introduction to Parallel Programming and MapReduce

The Google File System

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

MAPREDUCE Programming Model

Lecture Data Warehouse Systems

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Best Practices for Hadoop Data Analysis with Tableau

MapReduce and Hadoop Distributed File System

Introduction to Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data and Scripting map/reduce in Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

In Memory Accelerator for MongoDB

HADOOP PERFORMANCE TUNING

Big Table A Distributed Storage System For Data

Distributed File Systems

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

MapReduce and Hadoop Distributed File System V I J A Y R A O

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Challenges for Data Driven Systems

An Approach to Implement Map Reduce with NoSQL Databases

Optimization and analysis of large scale data sorting algorithm based on Hadoop

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

MapReduce with Apache Hadoop Analysing Big Data

GraySort and MinuteSort at Yahoo on Hadoop 0.23

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Advanced Data Management Technologies

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop IST 734 SS CHUNG

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

A Comparison of Approaches to Large-Scale Data Analysis

The Hadoop Framework

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Introduction to Hadoop

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

Hadoop and Map-Reduce. Swati Gore

Hypertable Architecture Overview

A programming model in Cloud: MapReduce

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Distributed File Systems

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Principles of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. MapReduce. Charlie Garrod Christian Kästner

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Chapter 18: Database System Architectures. Centralized Systems

The Sierra Clustered Database Engine, the technology at the heart of

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

MapReduce: A Flexible Data Processing Tool

MapReduce is a programming model and an associated implementation for processing

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Couchbase Server Under the Hood

White Paper. Optimizing the Performance Of MySQL Cluster

Apache Hadoop. Alexandru Costan

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

MapReduce: Algorithm Design Patterns

PERFORMANCE MODELS FOR APACHE ACCUMULO:

Architectures for Big Data Analytics A database perspective

Transcription:

MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able to crunch through the data (mining / processing) But, doing so efficiently requires a large-scale, parallel data processing system o How long does it take to go through 1 TB sequentially? 3 hours at 80 MB/s What if you have to do multiple scans or phases? o Parallel processing introduces a host of nasty distribution issues Communication and routing which nodes should be involved? what transport protocol should be used? how do you manage threads / events / connections? remote execution of your processing code? Fault tolerance and fault detection Goal of work more broadly, group membership Load balancing / partitioning of data heterogeneity of nodes skew in data network topology and bisection bandwidth issues o Also requires you to think about parallelization strategy How do you split work? Algorithmic issues. To solve these distribution issues once in a reusable library o To shield the programmer from having to resolve them each time they write a data analysis program To obtain adequate throughput and scalability out of the system To provide the programmer with a useful conceptual framework for designing their parallel algorithm

Map/reduce Two phases, conceptually, with hidden intermediate shuffle phase Map o For each key/value pair in the input file, the map() function is invoked set of input records is split into M different map splits the splits are processed in parallel on different worker invocations each worker serially iterates through its split each map() produces a set of intermediate key/value pairs as output into a local file on the map worker, bucketed by reduce partitioning function R; i.e., one bucket per reduce task Shuffle o for each reduce bucket, a reduce worker is identified for it. the worker slurps all of the buckets from the map worker local files o reduce worker then aggregates by key and sorts within the key, to get the (k2, list(v2)) form. Reduce o For each key in the intermediate output, the reduce() function is invoked Examples the invocation accepts the key and a sorted list of values reduce() does some kind of merge or filtering, and emits a list of values to an output file sort o map(linenum, line) { emit(sortkey(line), line) } o reduce(sortkey, list(line)) { emit(list(line)) } inverted index o map(docid, doctext) { foreach word(doctext) { emit(word, docid) } o reduce(word, list(docid)) { emit(word, sort(list(docid))); } grep o map(linenum, line) { if (grep(line, pattern) emit(linenum, line) } o reduce(linenum, list(line) { emit(list(line)) } join much harder ( A Comparison of Join Algorithms for Log Processing in MapReduce ) o standard is repartition join, aka partitioned sort-merge in database parlance imagine you have a log table L and some reference table R that contains user information you want to do an equijoin, L (equijoin where L.k = R.k) with R

L >> R implement in single MR job input is both L and R; each map task works on a split of L or R map task: o tags record with its originating table, and outputs: (join key, tagged record) as key/val shuffle phase will partition across join key, aggregate join key partitions, and sort reduce task: o for each join key: separate and buffer input records into two sets, according to the tag do a cross-product between the two sets problems both L and R end up having to be sorted by the shuffle phase both L and R end up having to be sent over the network during the shuffle phase reduce task: might not be able to keep entire input in memory, for skewed inputs o can do broadcast join if R is tiny, can replicate R on each map worker, and builds an inmemory hash table of R in each worker. then, hash-join against L in the map phase o several others as well which is best? depends on a bunch of things rest of MR job, data selectivity and skew, etc. Semantic issues In what order are the input values consumed? o Order determined by a split configuration value in user program o Right way to think about it is as unordered consumption order shouldn t matter for the correctness of your map/reduce program but, order could affect performance of the shuffle phase in the middle In what order are the intermediate values consumed? o This matters more: you might want the final output to be grouped across reduce keys in some additional way, so need to control the partitioning of the intermediate file across reduce tasks Partition function specifiable by mapreduce program (with a default that does hash partitioning) Should map or reduce have side-effects? o Its up to you, but. o Implementation does not guarantee exactly once semantics it provides at least once (except for deterministic bugs)

Implies side-effects must be idempotent for deterministic output Failure implies side-effects should be atomic for correct output o Should map and reduce functions be deterministic? Its up to you, but For same reason, non-determinism confuses semantics If deterministic, worker restart and atomic rename to commit output guarantees the execution is equivalent to some sequential execution If non-deterministic, get the blend of sequential executions (e.g., timestamps or counters go zooey) What mapreduce is built on Underlying GFS why? o Single namespace for input and output files; simplifies initial data distribution o Why not use GFS for the intermediate files?? Probably because they are transient and do not need to be globally readable More efficient to produce locally and special-purpose the RPC slurp to the reduce tasks Cluster management software o Failures o Group membership o Load balancing and scheduling Tweaks / improvements Combiner: basically a reduce run locally on the output of a map

I/O types: marshall/unmarshall libraries for parsing files Skip bad records: bad input deterministically crashing user function Status info: HTTP server with stats on progress, stdin/stdout of tasks, etc. Counters: propagated to master, included in HTTP server output Comparison to databases Huge source of controversy in DB community misconceptions in sys community about the degree of contribution of MR parallel databases have much more advanced data processing support that leads to much, much more efficient processing parallel databases support a much richer semantic model (arbitrary SQL, including joins) Paper by Stonebraker, DeWitt, and others: A Comparison of Approaches to Large-Scale Data Analysis parallel databases support a schema; MR does not o the structure of the data in MR must be built into map() and reduce() programs. hinders data sharing across programs, programmers. o if sharing is done, implicit agreement to a data model might as well codify in schema o lack of schema means lack of automated mechanism to enforce integrity constraints / typing parallel databases support indexes; MR does not o selection is accelerated massively with an index, including range queries (or other indexable subset operators) o no index at all in MR framework; if a programmer wants one, they have to implement the index themselves, and also enhance the data fetching mechanism in MR to use it. means it s hard to share indexes! parallel databases support full relational programming model (SQL); MR forces programmers to a more assembly-like model o e.g., have to implement your own join strategy if you want it in MR; SQL just naturally supports o difference between stating what you want (relational query), and how to do it (SQL) query optimization o parallel databases use existence of schema, indices, and declared query, as well as knowledge about how data is partitioned and the locality of it across the cluster, to support full query optimization: they calculate the most efficient query plan o MR forces programmers to do this, if they do it at all e.g., selectivity calculation try to push the least selective (i.e., most pruning) selection operator early, to minimize how much data has to flow between query operators

cannot do this in MR, and as a result, massive intermediate data sets end up materializing locally in files and being shipped / sorted in the shuffle phase e.g2, when the reduce phase of MR starts, all disks of map workers will be hammered by all of this pull of materialized local files. databases use push to avoid this, and avoid materializing files to disk where possible Where does MR win? loading data into system: easier to bulk load files into GFS than to pump records into DB, and current DBs are bad about parallelizing the load phase fault tolerance: unit of work (map or reduce) can fail and be restarted in MR. In DB, has fault tolerance, but unit of restart is the entire query, in part because intermediate results are not materialized on disk. o implication? as you scale up to larger and larger clusters, more likely that a query will need to be restarted because of a failure during the execution. o stonebraker paper: efficiency gains of DB means that you execute on much smaller clusters in practice, so this is a non-issue except for a very small number of players in the world approachability: more people are familiar with coding C++ than writing SQL Where do neither win? low-latency, incremental calculations (instead of high-throughput, batch-oriented programs) o google abandoned MR for its indexing operation, in favor of percolator, a system that can efficiently and transactionally update existing index as new pages are crawled. before percolator, had to recalculate entire index in a giant batch, so updates were delayed for weeks to accumulate new data in batch

Evaluation What questions are you curious about? What is their performance bottleneck for typical map/reduces? And what performance optimizations does this imply are possible? How well utilized are the physical machines during execution? What is the scaling bottleneck in the system? What are the effects of different scheduling policies on throughput? What happens if you schedule multiple mapreduces concurrently? What questions did they answer? How fast does it run on two programs? (grep, sort) Fast. I think. GREP Things to note from this graph exponential ramp up in rate o why? o hard to tell. artifact of their scheduler. my guess is that they took inspiration from slowstart to do congestion-based growth. is there equivalent of packet loss and backoff? yes two tasks co-scheduled on same node? peak input rate 30,000 MB/s o over 1764 nodes o à 17 MB/s per node. pretty good seq throughput! o why do they get this? o GFS property. heavy tail to the right o waiting for stragglers o very careful scheduling needed to avoid here o but they care about throughput, not latency, so who cares