COMP5426 Parallel and Distributed Computing. MapReduce

Transcription

1 COMP5426 Parallel and Distributed Computing MapReduce

2 Data-Intensive Computing Data-Intensive Typically store data at datacenters Use compute nodes nearby Compute nodes run computation services In data-intensive computing, the focus is on the data: problem areas include Storage Communication bottleneck Moving tasks to data (rather than vice-versa) Security Availability of Data Scalability

3 What is MapReduce? MapReduce is a distributed/parallel computing framework introduced by Google to support computing on large data sets on clusters of computers. Originally used at Google, now widely used as a more general platform for data-intensive computing 3

4 What is MapReduce? The framework is inspired by map and reduce functions commonly used in functional programming (although their purpose in the MapReduce framework is not the same as their original forms). User implements map() and reduce() functions Runtime library takes care of EVERYTHING else 4

5 What is MapReduce? A simple programming model that applies to certain large-scale data-intensive computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures Robustness Improvements to core library benefit all users of library! 5

6 Motivation Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of CPUs How to parallelize How to distribute How to handle failures Want to make this easy 6

7 Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Outline stays the same, but map and reduce change to fit the problem 7

8 Programming Model Map Takes an input pair and produces a set of intermediate key/value pairs e.g., Map: (key1, value1) -> list(key2,value2) The MapReduce library groups together all intermediate values associated with the same intermediate key Reduce This function accepts an intermediate key and a set of values for that key Reduce: (key2,list(key2,value2)) -> value3 8

9 Example: Counting Words Counting words in a large set of documents: map() Input <filename, file_text> Parses file and emits <word, count> pairs eg. < hello, 1> reduce() Sums all values for the same key and emits <word, TotalCount> eg. < hello, ( )> => < hello, 17> 9

10 Example: Use of MapReduce map(string key, string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, 1 ); reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result)); 10

11 Actual Source Code The example is written in pseudo-code Actual implementation is in C++, using a MapReduce library True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.) 11

12 Applications Structure of the Web: Input is (URL, contents) Scan through the document's contents looking for links to other URLs Map outputs (URL, linked-to URL) you get a simple representation of the WWW link graph Map outputs (linked-to URL, URL) you get the reverse link graph, what web pages link to me? Map outputs (linked-to URL, anchor text) you get how do other web pages characterize me? 12

13 Applications Google used MapReduce for Page indexing pipeline: What are all the pages that match this query? PageRank: What are the best pages that match this query? and more others Greatly simplified large-scale computations at Google 13

14 How MapReduce Works User to do list: Write map() and reduce() functions indicate: Input/output files M: number of map tasks R: number of reduce tasks W: number of machines Submit the job This requires no knowledge of parallel & distributed systems!! What about everything else? 14

15 The Infrastructure Large clusters of commodity PCs and networking hardware Clusters consists of 100/1000s of machines (failures are common) GFS (Google File System) Distributed file system Provides replication of the data

16 Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently. Synchronization required between the two functions. 16

17 Data Distribution Input, final output are stored on a distributed file system Scheduler tries to schedule Input files are split into M pieces on distributed file system Typically ~ 16 to 128 MB blocks map tasks close to physical storage location of input data Intermediate results are stored on local file system of map and reduce workers Output can be input to another map reduce task 17

18 Execution Workers are assigned work by the master The master is started by the MapReduce Framework

19 Execution Workers assigned map tasks read the input, parse it and invoke the user s Map() method.

20 Execution Intermediate key/value pairs are buffered in memory Periodically, buffered data is written to local disk (R files) Pseudo random partitioning function (e.g., (hash(k) mod R)

21 Execution Locations are passed back to the master who forwards these locations to workers executing the reduce function.

22 Execution Reduce runs after all mappers are done Workers executing Reduce are notified by the master about location of intermediate data

23 Execution Reduce workers use remote procedure calls to read the data from local disks of map works Sorts all intermediate data by intermediate key

24 Execution Reduce worker iterates over the sorted intermediate data and for each key encountered it passes the key and the corresponding set of intermediate values to the Reduce function

25 Execution The output of the Reduce function is appended to a final output file

26 Execution

27 Parallel Execution

28 Coordination Master data structures Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures

29 Observations No reduce can begin until map is complete Tasks scheduled based on location of data If map worker fails any time before reduce finishes, task must be completely rerun Master must communicate locations of intermediate files MapReduce library does most of the hard work for us! 29

30 Fault Tolerance Worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in-progress reduce tasks Task completion committed through master Master failure: State is checkpointed to replicated file system New master recovers & continues Very Robust: lost 1600 of 1800 machines once, but finished fine 30

31 Locality MapReduce master takes the location information of input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data Schedule a map task near a replica of that task s input data The goal is to read most input data locally and thus reduce the consumption of network bandwidth

32 Task Granularity M and R should be much larger than the number of available machines. Dynamic load balancing. Speeds up recovery in case of failures. R determines the number of output files Often constrained by users.

33 Backup Tasks Some Stragglers not performing optimally Other processes demanding resources Bad Disks (correctable errors) Slow down I/O speeds from 30MB/s to 1MB/s CPU cache disabled?! Near end of phase, schedule redundant execution of in-process tasks First to complete wins Slightly increases needed computational resources. Does not increase running time, but has the potential to improve it significantly.

34 Advantages & Disadvantages Transparent task/data distribution Fault tolerant Load balancing Restricted programming model Many problems not be expressed easily in this model Not efficient for iterative computational jobs

35 Spark Programming Framework In-memory data sharing - Allow applications to keep working sets in memory for efficient reuse Iterative and interactive computing Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability Simply programming interface Support a wide range of application

36 MapReduce HDFS read HDFS write HDFS read HDFS write iter. 1 iter Input HDFS read query 1 query 2 result 1 result 2 Input query 3... result 3 Slow due to replication and disk I/O, but necessary for fault tolerance

37 Goal: In-Memory Data Sharing Input iter. 1 iter one-time processing query 1 query 2 Input query faster than network/disk

38 Resilient Distributed Datasets (RDDs) Distributed collections of objects that can be cached in memory across cluster nodes Manipulated through various parallel operators Automatically rebuilt on failure Representation set of partitions set of dependencies lineage functions to compute RDD from parent RDDs metadata on partitioning scheme & data placement

39 YARN Hadoop 2.x YARN Yet Another Resource Negotiator designed to address the limitations of the original MapReduce. The fundamental idea is to separate resource management functions from programming frameworks. This separation enables great flexibility for other high-level programming frameworks considered as applications to run on top of YARN and MapReduce is just one of these applications.

40 YARN High Level Architecture

41 YARN High Level Architecture RN Resource Manager Single, centralized daemon for scheduling containers Monitors nodes and applications NM Node Manager Daemon running on each worker node in the cluster Launches, monitors, and controls containers AM Application Master Provides scheduling, monitor, control for an application instance RM launches an AM for each application submitted to the cluster AM requests containers via RM, launches containers via NM Containers Unit of allocation and control for YARN AM and application-specific tasks run within containers

42 Summary MapReduce a framework for data-intensive distributed computing. Distributed programs are easy to write and understand. Provides fault tolerance Program execution can be easily monitored. Spark A new programming framework The key idea is RDD Store data in memory for efficient iterative computing YARN a new resource management architecture Address the limitations of the original MapReduce Decouple programming models from resource management Delegate many scheduling functions to per-application components Currently, a lot of research and development still going on

43 References Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters - Ralf Lammel, Google's MapReduce Programming Model Revisited

44 References M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012 Apache Hadoop YARN: Yet Another Resource Negotiator, Proceedings of ACM SoCC