1 Big Data Processing with Google s MapReduce Alexandru Costan
Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2
Motivation Big Data @Google: 20+ billion web pages x 20KB = 400+ TB One computer can read 30-35 MB/sec from disk 4 months to read the web 1,000 hard drives just to store the web Even more time/hdd, to do something with the data: Data processing Data analytics 3
Solution Spread the work over many machines Good news: easy parallelization Reading the web with 1000 machines less than 3 hours Bad news: programming work Communication and coordination Debugging Fault tolerance Management and monitoring Optimization Worse news: repeat this for every problem 4
The size is always increasing Every Google service sees continuous growth in computational needs More queries More users, happier users More data Bigger web, mailbox, blog, etc. Better results Find the right information, and find it faster 5
Typical computer @Google Multicore machine 1-2 TB of disk 4GB-16GB of RAM Typical machine runs: Google File System (GFS) Scheduler daemon for starting user tasks One or many user tasks Tens of thousands of such machines Problem : What programming model to use as a basis for scalable parallel processing? 6
What is needed? A simple programming model that applies to many data-intensive computing problems Approach: hide messy details in a runtime library: Automatic parallelization Load balancing Network and disk transfer optimization Handling of machine failures Robustness Improvements to core library benefit all users of library 7
Sucha a model is MapReduce Typical problem solved by MapReduce: Read a lot of data Map: extract something interesting from each record Shuffle and Sort Reduce: aggregate, summarize, filter or transform Write the results Outline stays the same, map and reduce change to fit the problem 8
MapReduce at a glance 9
More specifically 10 It is inspired by the Map and Reduce functions (i.e., it borrows from functional programming) Users implement the interface of two primary functions map(k, v) <k', v'>* reduce(k', <v'>*) <k', v''>* All v' with same k' are reduced together, and processed in v' order
Example 1: word count 11
Example 2: word length count 12
Example 2: word length count 13
Example 2: word length count 14
Zoom on the Map phase Map Phase 15 Reduce Phase <key, value> Map Reduce Input Output Map Reduce Map Map Shuffle Reduce map(k, v) <k', v'>* Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line)
Combiner For certain types of reduce functions (commutative and associative), one can decrease the communicatioon cost by running the reduce function within the mappers: SUM, COUNT, MAX, MIN... Example, word count Without Combiner: With Combiner: <docid, {list of words}> => c records <word, 1> <docid, {list of words}> => <word, c> c, the number of times the word appears in the mapper. 16
Zoom on the Shuffle phase Map Phase 17 Reduce Phase <key, value> Map Reduce Input Output Map Reduce Map Map Shuffle Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list
Zoom on the Reduce phase Map Phase 18 Reduce Phase <key, value> Map Reduce Input Output Map Reduce Map Map Shuffle Reduce reduce(k', <v'>*) <k', v''>* reduce() combines those intermediate values into one or more final values per key (usually only one)
System architecture 19 One master, many workers Master partitions input file into M splits, by key Master assigns workers (=servers) to the M map tasks, keeps track of their progress Workers write their output to local disk, partition into R regions Master assigns workers to the R reduce tasks Reduce workers read regions from the map workers local disks Often: 1 split / chunk = 64 MB, M=200,000; R=4,000; workers=2,000
20 Architectural overview Google MapReduce worker worker 20
Scheduling - Map Master assigns each map task to a free worker: Considers locality of data to worker when assigning task Worker reads task input (often from local disk) Worker applies the map function to each record in the split / task. Worker produces R local files / partitions containing intermediate k/v pairs : Using a partition function E.g., hash(key) mod R 21
Scheduling - Reduce Master assigns each reduce task to a free worker The ith reduce worker reads the ith partition output by each map using remote procedure calls Data is sorted based on the keys so that all occurrences of the same key are close to each other. Reducer iterates over the sorted data and passes all records from the same key to the user defined reduce function. 22
Features Exploit parallelization: Tasks are executed in parallel Fault tolerance Re-execute only the tasks on failed machines Exploit data locality Co-locate data and computation: avoid network bottleneck 23
Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can t start until map phase is completely finished. 24
Fault tolerance Master detects worker failures Master pings workers periodically If down then reassigns the task to another worker Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values that cause crashes in map(), and skips those values on re-execution. 25
Fault tolerance Backup tasks: Straggler = a machine that takes unusually long time to complete one of the last tasks. Reasons: Bad disk forces frequent correctable errors (30MB/ s to 1MB/s) The cluster scheduler has scheduled other tasks on that machine Stragglers are a main reason for slowdown Solution: pre-emptive backup execution of the last few remaining in-progress tasks 26
Widely used at Google distributed grep distributed sort term-vector per host document clustering machine learning web access log stats web link-graph reversal inverted index construction statistical machine translation 27
Many implementations 28
MapReduce limitations Not efficient for real-time processing Very limited queries: Difficult to write more complex tasks Need multiple map-reduce operations Solutions: declarative query languages J No support for iterative processing Barrier between Map and Reduce 29
MapReduce extensions Supporting iterative processing Supporting pipeline / reduce intensive workloads 30
Supporting iterative processing MapReduce can t express recursion/iteration Lots of interesting programs need loops: graph algorithms, clustering, machine learning, recursive queries Dominant solution: use a driver program outside of MapReduce Hypothesis: making MapReduce loop-aware affords optimization scalable implementations of recursive languages 31
Supporting iterative processing Cache the invariant data in the first iteration: then reuse it in later iterations. Cache the reducer outputs: makes checking for a fixpoint more efficient, without an extra MapReduce job. Twister, HaLoop, imapreduce 32
Pipeline MapReduce The reducers can begin their processing of the data as soon as it is produced by mappers MapReduce jobs run continuously, accepting new data as it arrives and analyzing it immediately: continuous queries event monitoring and stream processing. Pipelining delivers data to downstream operators more promptly increase parallelism, improve utilization and reduce response time. 33
An example of a reduction tree "#$ "#$ "#$% "#$ "#$ "#$% "#$% "#$% "#$ %&'()&* "#$% "#$& %&'()&* "#$' MapReduce Workshop, Delft, 18 June 2012 34 34
MapReduce summary Hides scheduling and parallelization details Simple to program, only needed: Map Reduce Efficient for batch processing, not efficient for real-time Several extensions for iterative, pipeline processing Additional reading: The Family of MapReduce and Large-Scale Data Processing Systems 35