Big Data Processing with Google s MapReduce. Alexandru Costan

Similar documents
MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Apache Hadoop s MapReduce

16.1 MAPREDUCE. For personal use only, not for distribution. 333

MapReduce (in the cloud)

Big Data With Hadoop

Introduction to Parallel Programming and MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Hadoop and Map-Reduce. Swati Gore

Cloud Computing at Google. Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Chapter 7. Using Hadoop Cluster and MapReduce

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

MapReduce and Hadoop Distributed File System V I J A Y R A O

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Parallel Processing of cluster by Map Reduce

Introduction to Hadoop


Data-Intensive Computing with Map-Reduce and Hadoop

Developing MapReduce Programs

Map Reduce / Hadoop / HDFS

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

MAPREDUCE Programming Model

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Apache Hadoop. Alexandru Costan

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

ImprovedApproachestoHandleBigdatathroughHadoop

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

MapReduce and Hadoop Distributed File System

Introduction to Hadoop

Distributed Computing and Big Data: Hadoop and MapReduce

Cloud Computing using MapReduce, Hadoop, Spark

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

A programming model in Cloud: MapReduce

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Scaling Out With Apache Spark. DTL Meeting Slides based on

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

Hadoop IST 734 SS CHUNG

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Open source Google-style large scale data analysis with Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Putting Apache Kafka to Use!

Introduction to Hadoop

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Optimization and analysis of large scale data sorting algorithm based on Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data and Scripting map/reduce in Hadoop

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Bringing Big Data Modelling into the Hands of Domain Experts

Infrastructures for big data

Big Data Analytics Hadoop and Spark

Big Data Processing. Patrick Wendell Databricks

HADOOP PERFORMANCE TUNING

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Log Mining Based on Hadoop s Map and Reduce Technique

Hadoop Ecosystem B Y R A H I M A.

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

How To Scale Out Of A Nosql Database

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop Architecture. Part 1

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Cloud Computing

Big Data Analytics. Lucas Rego Drumond

Analysis of MapReduce Algorithms

Data-intensive computing systems

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Architectures for Big Data Analytics A database perspective

Lecture Data Warehouse Systems

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Transcription:

1 Big Data Processing with Google s MapReduce Alexandru Costan

Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2

Motivation Big Data @Google: 20+ billion web pages x 20KB = 400+ TB One computer can read 30-35 MB/sec from disk 4 months to read the web 1,000 hard drives just to store the web Even more time/hdd, to do something with the data: Data processing Data analytics 3

Solution Spread the work over many machines Good news: easy parallelization Reading the web with 1000 machines less than 3 hours Bad news: programming work Communication and coordination Debugging Fault tolerance Management and monitoring Optimization Worse news: repeat this for every problem 4

The size is always increasing Every Google service sees continuous growth in computational needs More queries More users, happier users More data Bigger web, mailbox, blog, etc. Better results Find the right information, and find it faster 5

Typical computer @Google Multicore machine 1-2 TB of disk 4GB-16GB of RAM Typical machine runs: Google File System (GFS) Scheduler daemon for starting user tasks One or many user tasks Tens of thousands of such machines Problem : What programming model to use as a basis for scalable parallel processing? 6

What is needed? A simple programming model that applies to many data-intensive computing problems Approach: hide messy details in a runtime library: Automatic parallelization Load balancing Network and disk transfer optimization Handling of machine failures Robustness Improvements to core library benefit all users of library 7

Sucha a model is MapReduce Typical problem solved by MapReduce: Read a lot of data Map: extract something interesting from each record Shuffle and Sort Reduce: aggregate, summarize, filter or transform Write the results Outline stays the same, map and reduce change to fit the problem 8

MapReduce at a glance 9

More specifically 10 It is inspired by the Map and Reduce functions (i.e., it borrows from functional programming) Users implement the interface of two primary functions map(k, v) <k', v'>* reduce(k', <v'>*) <k', v''>* All v' with same k' are reduced together, and processed in v' order

Example 1: word count 11

Example 2: word length count 12

Example 2: word length count 13

Example 2: word length count 14

Zoom on the Map phase Map Phase 15 Reduce Phase <key, value> Map Reduce Input Output Map Reduce Map Map Shuffle Reduce map(k, v) <k', v'>* Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line)

Combiner For certain types of reduce functions (commutative and associative), one can decrease the communicatioon cost by running the reduce function within the mappers: SUM, COUNT, MAX, MIN... Example, word count Without Combiner: With Combiner: <docid, {list of words}> => c records <word, 1> <docid, {list of words}> => <word, c> c, the number of times the word appears in the mapper. 16

Zoom on the Shuffle phase Map Phase 17 Reduce Phase <key, value> Map Reduce Input Output Map Reduce Map Map Shuffle Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list

Zoom on the Reduce phase Map Phase 18 Reduce Phase <key, value> Map Reduce Input Output Map Reduce Map Map Shuffle Reduce reduce(k', <v'>*) <k', v''>* reduce() combines those intermediate values into one or more final values per key (usually only one)

System architecture 19 One master, many workers Master partitions input file into M splits, by key Master assigns workers (=servers) to the M map tasks, keeps track of their progress Workers write their output to local disk, partition into R regions Master assigns workers to the R reduce tasks Reduce workers read regions from the map workers local disks Often: 1 split / chunk = 64 MB, M=200,000; R=4,000; workers=2,000

20 Architectural overview Google MapReduce worker worker 20

Scheduling - Map Master assigns each map task to a free worker: Considers locality of data to worker when assigning task Worker reads task input (often from local disk) Worker applies the map function to each record in the split / task. Worker produces R local files / partitions containing intermediate k/v pairs : Using a partition function E.g., hash(key) mod R 21

Scheduling - Reduce Master assigns each reduce task to a free worker The ith reduce worker reads the ith partition output by each map using remote procedure calls Data is sorted based on the keys so that all occurrences of the same key are close to each other. Reducer iterates over the sorted data and passes all records from the same key to the user defined reduce function. 22

Features Exploit parallelization: Tasks are executed in parallel Fault tolerance Re-execute only the tasks on failed machines Exploit data locality Co-locate data and computation: avoid network bottleneck 23

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can t start until map phase is completely finished. 24

Fault tolerance Master detects worker failures Master pings workers periodically If down then reassigns the task to another worker Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values that cause crashes in map(), and skips those values on re-execution. 25

Fault tolerance Backup tasks: Straggler = a machine that takes unusually long time to complete one of the last tasks. Reasons: Bad disk forces frequent correctable errors (30MB/ s to 1MB/s) The cluster scheduler has scheduled other tasks on that machine Stragglers are a main reason for slowdown Solution: pre-emptive backup execution of the last few remaining in-progress tasks 26

Widely used at Google distributed grep distributed sort term-vector per host document clustering machine learning web access log stats web link-graph reversal inverted index construction statistical machine translation 27

Many implementations 28

MapReduce limitations Not efficient for real-time processing Very limited queries: Difficult to write more complex tasks Need multiple map-reduce operations Solutions: declarative query languages J No support for iterative processing Barrier between Map and Reduce 29

MapReduce extensions Supporting iterative processing Supporting pipeline / reduce intensive workloads 30

Supporting iterative processing MapReduce can t express recursion/iteration Lots of interesting programs need loops: graph algorithms, clustering, machine learning, recursive queries Dominant solution: use a driver program outside of MapReduce Hypothesis: making MapReduce loop-aware affords optimization scalable implementations of recursive languages 31

Supporting iterative processing Cache the invariant data in the first iteration: then reuse it in later iterations. Cache the reducer outputs: makes checking for a fixpoint more efficient, without an extra MapReduce job. Twister, HaLoop, imapreduce 32

Pipeline MapReduce The reducers can begin their processing of the data as soon as it is produced by mappers MapReduce jobs run continuously, accepting new data as it arrives and analyzing it immediately: continuous queries event monitoring and stream processing. Pipelining delivers data to downstream operators more promptly increase parallelism, improve utilization and reduce response time. 33

An example of a reduction tree "#$ "#$ "#$% "#$ "#$ "#$% "#$% "#$% "#$ %&'()&* "#$% "#$& %&'()&* "#$' MapReduce Workshop, Delft, 18 June 2012 34 34

MapReduce summary Hides scheduling and parallelization details Simple to program, only needed: Map Reduce Efficient for batch processing, not efficient for real-time Several extensions for iterative, pipeline processing Additional reading: The Family of MapReduce and Large-Scale Data Processing Systems 35