From GWS to MapReduce: Google s Cloud Technology in the Early Days

Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org

MapReduce/Hadoop Around 2004, Google invented MapReduce to parallelize computation of large data sets. It s been a key component in Google s technology foundation Around 2008, Yahoo! developed the open-source variant of MapReduce named Hadoop After 2008, MapReduce/Hadoop become a key technology component in cloud computing MapReduce Hadoop Hadoop or variants In 2010, the U.S. conferred the MapReduce patent to Google

Data-Intensive Computation MapReduce parallel computing for Webscale data Map: a higher-order function applied to all elements in a list Result is a new list 1 2 3 4 5 The MapReduce Approach x 2 x 2 x 2 x 2 x 2 1 4 9 16 25 (map (lambda (x) (* x x)) '(1 2 3 4 5)) '(1 4 9 16 25)

Data-Intensive Computation The MapReduce Approach Reduce is also a higher-order function Like fold : aggregate elements of a list Accumulator set to initial value Function applied to list element and the accumulator Result stored in the accumulator Repeated for every item in the list Result is the final value in the accumulator 1 2 3 4 5 Initial value + + + + + final value 0 1 3 6 10 15 (fold + 0 '(1 2 3 4 5)) 15 (fold * 1 '(1 2 3 4 5)) 120

Massive parallel processing made simple Example: word count MapReduce programming The MapReduce Approach Map: parse a document and generate <word, 1> pairs Reduce: receive all pairs for a specific word, and count (sum) Map // D is a document for each word w in D output <w, 1> Reduce Reduce for key w: count = 0 for each input item count = count + 1 output <w, count>

Workflow 1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

Workflow 3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs (KV pairs) produced by the Map function are buffered in memory. 4. Periodically, the buffered pairs are written to local disks, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

Workflow 5. A reduce worker processes one partition of data. When a reduce worker, r, is notified by the master about the locations of intermediate data, r uses RPCs to read the data in its partition from the local disks of the map workers. After all intermediate data are read, r sorts and groups them by the intermediate keys. 6. Each reduce worker iterates over the sorted intermediate KV pairs in its partition and, for each unique intermediate key encountered, it passes the key and the associated set of values to the Reduce function. The output of the Reduce function is appended to a temporary output file on GFS. 7. A completed reduce task renames its temporary output file to the output file of its partition. When all tasks have completed, MapReduce returns to the user code.

Programming How to write a MapReduce program to Generate inverted indices? Sort? How to express more sophisticated logic? What if some workers (slaves) or the master fails?

MapReduce Limitations MapReduce provides an easy-to-use framework for parallel programming, but is it the most efficient and best solution to program execution in datacenters? MapReduce has its discontents DeWitt and Stonebraker: MapReduce: A major step backwards MapReduce is far less sophisticated and efficient than parallel query processing MapReduce is a parallel processing framework, not a database system, nor a query language It is possible to use MapReduce to implement some of the parallel query processing functions What are the real limitations? Inefficient for general programming (and not designed for that) Hard to handle data with complex dependence, frequent updates, etc. High overhead, bursty I/O, difficult to handle long streaming data Limited opportunity for optimization

Slow Problems of MapReduce/Hadoop Cannot support low-latency, interactive, or real-time programs Very slow, if the algorithm processes data in a graph model contains non-trivial dependences among operations walks through multiple iterations follows a complex data/control flow Not programmable, if the algorithm invokes recursion of Map and Reduce steps Requires an external glue language takes unpredictable execution time 11

Time (sec) 9000 8000 7000 6000 5000 Computing Capability 9044 Z. Ma and L. Gu. The Limitation of MapReduce: a Probing Case and a Lightweight Solution. CLOUD COMPUTING 2010 gcc (on one node) mrcc/hadoop mrcc/mrlite 4000 3000 2936 2000 1000 0 506 1419 653 312 50 128 65 Linux kernel ImageMagick Xen tools Using Hadoop, parallel compilation is even slower than sequential compilation! 12

More Solutions Coming Dryad/DryadLINQ Megastore/Spanner Pregel OpenCL RamCloud The prosperity of solutions in this area, however, indicates that there has not been a solid, general-purpose technology for cloud systems. There are challenges in constructing sophisticated big-data processing capability, wide-area computation and very-large database systems.

SPARK/RDD RDD: Resilient Distributed Datasets Compute in the memory Operators transform data 20 times faster than Hadoop on some iterative workloads Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. of NSDI '12. RDD is not a general-purpose instrument for big-data computation. We also expect empirical data on larger-scale workload to understand its scalability. In-memory computation will be increasingly important in big-data computation.

Inside MapReduce-Style Computation Network activities under MapReduce/Hadoop workload Hadoop: open-source implementation of MapReduce Processing data with 3 servers (20 cores) 116.8GB input data Network activities captured with Xen virtual machines

MapReduce workflow Initial data split into 64MB blocks Computed, results locally stored Master informed of result locations R reducers retrieve Data from mappers Final output written Communication intensive

Inside MapReduce Packet reception under MapReduce/Hadoop workload Large data volume Bursty network traffic Genrality widely observed in MapReduce workloads Packet reception on a slave server

Inside MapReduce Packet reception on the master server

Inside MapReduce Packet transmission on the master server

Datacenter Networking Common data center topology Internet Core Layer-3 router Datacenter Aggregation Layer-2/3 switch Access Layer-2 switch Servers