MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.



Similar documents
Performance and Energy Efficiency of. Hadoop deployment models

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Beyond Hadoop with Apache Spark and BDAS

Map Reduce / Hadoop / HDFS

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Data Science in the Wild

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Parallel Processing of cluster by Map Reduce

Big Data Analytics. Lucas Rego Drumond

Big Data Processing with Google s MapReduce. Alexandru Costan

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data With Hadoop

Spark. Fast, Interactive, Language- Integrated Cluster Computing

MapReduce (in the cloud)

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Big Data and Apache Hadoop s MapReduce

Unified Big Data Processing with Apache Spark. Matei

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Parallel Computing. Benson Muite. benson.

Data-Intensive Computing with Map-Reduce and Hadoop

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Spark: Making Big Data Interactive & Real-Time

Big Data and Scripting Systems beyond Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Introduction to Parallel Programming and MapReduce

Introduction to Big Data with Apache Spark UC BERKELEY

Writing Standalone Spark Programs

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Unified Big Data Analytics Pipeline. 连 城

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Apache Flink Next-gen data analysis. Kostas

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Cloud Computing at Google. Architecture

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Optimization and analysis of large scale data sorting algorithm based on Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

ImprovedApproachestoHandleBigdatathroughHadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Fault Tolerance in Hadoop for Work Migration

Lecture Data Warehouse Systems

Energy Efficient MapReduce

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Application Execution on Cloud using Hadoop Distributed File System

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

MapReduce and Hadoop Distributed File System

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN


Introduction to Spark

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Introduction to Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Hadoop Architecture. Part 1

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

MapReduce and Hadoop Distributed File System V I J A Y R A O

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Apache Spark and Distributed Programming

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Comparison of Different Implementation of Inverted Indexes in Hadoop

Survey on Scheduling Algorithm in MapReduce Framework

Hadoop Parallel Data Processing

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data Analytics Hadoop and Spark

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Chapter 7. Using Hadoop Cluster and MapReduce

Pre-Stack Kirchhoff Time Migration on Hadoop and Spark

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Hadoop Ecosystem B Y R A H I M A.

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Yuji Shirasaki (JVO NAOJ)

CS 378 Big Data Programming. Lecture 24 RDDs

An Architecture for Fast and General Data Processing on Large Clusters

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

Transcription:

MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.

Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use hundreds or thousands of CPUs... but this needs to be easy MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

Model is Widely Applicable MapReduce Programs In Google Source Tree

Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Inspired by similar primitives in LISP and other languages

Execution

Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines

Fault tolerance: Handled via re-execution On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine

Refinements Redundant Execution Locality Optimization Skipping Bad Records Sorting guarantees within each reduce partition Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters

Experience: Rewrite of Production Indexing System Rewrote Google's production indexing system using MapReduce Set of 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow machines Easy to make indexing faster by adding more machines

Resilient(Distributed(Datasets( A"Fault(Tolerant"Abstraction"for" In(Memory"Cluster"Computing" Matei(Zaharia,"Mosharaf"Chowdhury,"Tathagata"Das," Ankur"Dave,"Justin"Ma,"Murphy"McCauley," Michael"Franklin,"Scott"Shenker,"Ion"Stoica" " UC"Berkeley" UC"BERKELEY"

Motivation( MapReduce"greatly"simplified" big"data "analysis" on"large,"unreliable"clusters" But"as"soon"as"it"got"popular,"users"wanted"more:"» More"complex,"multi(stage"applications" (e.g."iterative"machine"learning"&"graph"processing)"» More"interactive"ad(hoc"queries" Response:"specialized"frameworks"for"some"of" these"apps"(e.g."pregel"for"graph"processing)(

Motivation( Complex"apps"and"interactive"queries"both"need" one"thing"that"mapreduce"lacks:" " Efficient"primitives"for"data(sharing" In"MapReduce,"the"only"way"to"share"data" across"jobs"is"stable"storage" "slow!"

Examples( HDFS" read" HDFS" write" HDFS" read" HDFS" write" iter."1" iter."2".((.((.( Input" HDFS" read" query"1" query"2" result"1" result"2" Input" query"3".((.((.( result"3" Slow"due"to"replication"and"disk"I/O," but"necessary"for"fault"tolerance"

Goal:(In>Memory(Data(Sharing( Input" iter."1" iter."2".((.((.( one(time" processing" query"1" query"2" Input" query"3".((.((.( 10(100 "faster"than"network/disk,"but"how"to"get"ft?"

Challenge( How"to"design"a"distributed"memory"abstraction" that"is"both"fault>tolerant"and"efficient?"

Challenge( Existing"storage"abstractions"have"interfaces" based"on"fine,grained"updates"to"mutable"state"» RAMCloud,"databases,"distributed"mem,"Piccolo" Requires"replicating"data"or"logs"across"nodes" for"fault"tolerance"» Costly"for"data(intensive"apps"» 10(100x"slower"than"memory"write"

Solution:(Resilient(Distributed( Datasets((RDDs)( Restricted"form"of"distributed"shared"memory"» Immutable,"partitioned"collections"of"records"» Can"only"be"built"through"coarse,grained" deterministic"transformations"(map,"filter,"join," )" Efficient"fault"recovery"using"lineage0» Log"one"operation"to"apply"to"many"elements"» Recompute"lost"partitions"on"failure"» No"cost"if"nothing"fails"

RDD(Recovery( Input" iter."1" iter."2".((.((.( one(time" processing" query"1" query"2" Input" query"3".((.((.(

Generality(of(RDDs( Despite"their"restrictions,"RDDs"can"express" surprisingly"many"parallel"algorithms"» These"naturally"apply0the0same0operation0to0many0items0 Unify"many"current"programming"models"» Data0flow0models:"MapReduce,"Dryad,"SQL," "» Specialized0models"for"iterative"apps:"BSP"(Pregel)," iterative"mapreduce"(haloop),"bulk"incremental," " Support"new0apps0that"these"models"don t"

Tradeoff(Space( Fine" Granularity( of(updates( K(V"stores," databases," RAMCloud" HDFS" Network" bandwidth" Best(for( transactional( workloads( RDDs" Memory" bandwidth" Best(for(batch( workloads( Coarse" Low" Write(Throughput( High"

Spark(Programming(Interface( DryadLINQ(like"API"in"the"Scala"language" Usable"interactively"from"Scala"interpreter" Provides:"» Resilient"distributed"datasets"(RDDs)"» Operations"on"RDDs:"transformations"(build"new"RDDs)," actions"(compute"and"output"results)"» Control"of"each"RDD s"partitioning"(layout"across"nodes)" and"persistence"(storage"in"ram,"on"disk,"etc)0

Spark(Operations( Transformations( (define"a"new"rdd)" map" filter" sample" groupbykey" reducebykey" sortbykey" flatmap" union" join" cogroup" cross" mapvalues" Actions( (return"a"result"to" driver"program)" collect" reduce" count" save" lookupkey"

Example:(Log(Mining( Load"error"messages"from"a"log"into"memory,"then" interactively"search"for"various"patterns" Base"RDD" lines = spark.textfile( hdfs://... ) Transformed"RDD" results" errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) tasks" Master" messages.persist() Worker" Block"1" Msgs."1" messages.filter(_.contains( foo )).count messages.filter(_.contains( bar )).count Result:"scaled"to"1"TB"data"in"5(7"sec" Result:"full(text"search"of"Wikipedia" in"<1"sec"(vs"20"sec"for"on(disk"data)" (vs"170"sec"for"on(disk"data)" Action" Msgs."3" Worker" Block"3" Worker" Block"2" Msgs."2"

Fault(Recovery( RDDs"track"the"graph"of"transformations"that" built"them"(their"lineage)"to"rebuild"lost"data" E.g.:" " " messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD" FilteredRDD" MappedRDD" HadoopRDD" " path"="hdfs:// " FilteredRDD" " func"="_.contains(...)" MappedRDD" " func"="_.split( )"

Fault(Recovery(Results( Iteratrion(time((s)( 140" 120" 100" 80" 60" 40" 20" 0" 119" Failure"happens" 81" 57" 56" 58" 58" 57" 59" 57" 59" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" Iteration(

Example:(PageRank( 1."Start"each"page"with"a"rank"of"1" 2."On"each"iteration,"update"each"page s"rank"to" Σ i neighbors "rank i "/" neighbors i " links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatmap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reducebykey(_ + _) }

Optimizing(Placement( Links" (url,"neighbors)" Ranks 0" (url,"rank)" join" Contribs 0" reduce" Ranks 1" join" Contribs 2" reduce" Ranks 2"... links"&"ranks"repeatedly"joined Can"co,partition"them"(e.g."hash" both"on"url)"to"avoid"shuffles" Can"also"use"app"knowledge," e.g.,"hash"on"dns"name" links = links.partitionby( new URLPartitioner())

PageRank(Performance( Time(per(iteration((s)( 200" 150" 100" 50" 0" 171" 72" 23" Hadoop" Basic"Spark" Spark"+"Controlled" Partitioning"

Implementation( Runs"on"Mesos"[NSDI"11]" to"share"clusters"w/"hadoop" Can"read"from"any"Hadoop" input"source"(hdfs,"s3," )" Spark" Hadoop" MPI" ( Mesos" Node" Node" Node" Node" No"changes"to"Scala"language"or"compiler"» Reflection"+"bytecode"analysis"to"correctly"ship"code" www.spark(project.org"" "

Programming(Models( Implemented(on(Spark( RDDs"can"express"many"existing"parallel"models"» MapReduce,(DryadLINQ"» Pregel"graph"processing"[200"LOC]"» Iterative(MapReduce"[200"LOC]"» SQL:"Hive"on"Spark"(Shark)"[in"progress]" All"are"based"on" coarse(grained" operations" Enables"apps"to"efficiently"intermix"these"models"

Conclusion( RDDs"offer"a"simple"and"efficient"programming" model"for"a"broad"range"of"applications" Leverage"the"coarse(grained"nature"of"many" parallel"algorithms"for"low(overhead"recovery" Try"it"out"at"www.spark>project.org(