Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 1
MapReduce in More Detail 2
Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks. Tasks are assigned to the worker processes dynamically. In the reduction phase there are as many calls to reduce as the output files of the map phase (but only 1 reducer process by default). A typical Google run: Mappers=200,000; Reducers=4,000; Workers=2,000. The master assigns each map task (1 split) to one worker: Worker reads the input ideally from the local disk. And produces R local files with <key,value> pairs. 3
Master (ii) Master assigns each reduce task to a reducer: The reducer reads the intermediate files from the mappers. The reducer orders the <key,value> pairs and applies the reduce function. The user may specify partitions to control which values each reducer gets. 4
Fault Tolerance Worker faults: Faults are detected by periodic pings. Ready or ongoing tasks that fail are redone. Ongoing reduce tasks that fail are redone. Finished tasks are reported to the master. Master failure: The master state is checkpointed in a distributed file system. When a failed master is restarted, he reads the saved state and continues from that point on. 5
Splits Input data is provided to the mappers in splits. A split is a block of input data. Default HDFS are 64MB big (by default). In particular, Hadoop tries to create splits containing data local only to one node. A mapper then runs on the node where the split was created. 6
Input Formats Controlled by conf.setinputformat FileInputFormat (default option): 64MB blocks (for files 64MB) per split (and per map). CombineFileInputFormat: groups files on the same node into the same split. KeyValueTextInputFormat NLineInputFormat 7
Output Formats Controlled by conf.setoutputformat. TextOutputFormat: one file per reducer (default option). SequenceFileOutputFormat: compressed output. Normally used to feed other MapReduce cycles. MultipleOutputFormat: allows control over the amount and name of output files. It is also possible to output to databases. 8
Number of Tasks The number of mappers is determined based on the size of the input data. By default there is only 1 reducer process. Both are configurable: conf.setnummaptasks(int num); conf.setnumreducetasks(int num); 9
Partitions Partitioners partition the intermediate data before being submitted to the reducers. By default, the partition is determined by the hashcode function of the intermediate (map output) key class. You can change it by reimplementing the key s hashcode() method or by replacing the partitioner with a new one. 11
Example of a Partitioner public class HashPartitioner extends Partitioner<K2, V2> { public void configure(jobconf job) {} public int getpartition(k2 key, V2 value, int numpartitions) { return (key.hashcode() & Integer.MAX_VALUE) % } } numpartitions; 12
Combiner They reduce the amount of data at the mappers output by aggregating intermediate results from one node. In the Wordcount example (last lecture), we could aggregate the mappers output before shuffling them into the reducers. By default there is no combiner. Combiners are implemented like a reducer class and are added to the workflow by calling conf.setcombinerclass(); 13
Main Hadoop MapReduce Parameters 14
Hadoop MapReducer Servers Jobtracker Java class. It s the central manager of a MapReduce job. Tasktracker Java class. It s the process responsible for starting local map and reduce workers. Periodically notifies the jobtracker that he s still alive. 15
Execution of MapReduce Programs 16
Worker Faults Situations where the workers notify the jobtracker: Exceptions; Abrupt termination of the JVM. A worker is failed if it spends longer than mapred.task.timeout without reporting to the jobtracker. Map output files are in the local file system and not HDFS. It is not efficient to replicate output data (mappers can be repeated). Failed workers are retried 4 times on different nodes. It s important for mappers and reducer to be able to deal with corrupted data. After the 2 nd retry, they change to skipping mode: All input records that cause errors are skipped. 17
Server Faults If the tasktracker fails or is too slow reporting to the jobtracker, its jobs are restarted at other nodes. If the jobtracker fails, all work is lost. It s the main Hadoop problem. 18
Scheduling Hadoop MapReduce scheduling has evolved to ensure greater evenness: Originally, Hadoop scheduled MapReduce jobs in FIFO order. Later, FIFO with priorities (VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW) without preemption. Then, the scheduler was transformed into a plugin. Example: FairScheduler (Facebook) Jobs are grouped (pool; by Unix use, Unix group, labels). Each group reserves a minimum amount of map and reduce slots with a maximum amount of running jobs. Free slots are distributed evenly between all groups. 19
hadoop-site.xml <property> <name>mapred.jobtracker.taskscheduler</name> <value>org.apache.hadoop.mapred.fairscheduler</val ue> </property> <property> <name>mapred.fairscheduler.allocation.file</name> <value>/path/to/pools.xml</value> </property> 20
pools.xml <?xml version="1.0"?> <allocations> <pool name="ads"> <minmaps>10</minmaps> <minreduces>5</minreduces> <maxrunningjobs>3</maxrunningjobs> </pool> <user name= smith"> <maxrunningjobs>1</maxrunningjobs> </user> <usermaxjobsdefault>10</usermaxjobsdefault> </allocations> 21
Scheduling (2) CapacityScheduler (Yahoo): Keeps a set of queues. One queue by user, group, organization. Fairness is guaranteed between queues like in the Facebook FairScheduler: Slot reservations + Even distribution of the excess slots. Within each queue, scheduling is FIFO with priorities and with a minimum quantum before preemption. 22
A MapReduce Graph Application Performing calculation over a graph requires processing each node. Each node has specific information as well as arcs to other nodes. The calculation must go through the graph and run the computational step. How can we traverse the graph using MapReduce? How can we represent the graph in MapReduce? 23
Dijkstra Algorithm: Shortest Path to a node in a graph 1 function Dijkstra(Graph, source): 2 for each vertex v in Graph: 3 dist[v] := infinity 4 previous[v] := undefined 5 dist[source] := 0 6 Q := the set of all nodes in Graph 7 while Q is not empty: 8 u := vertex in Q with smallest dist[] 9 if dist[u] = infinity: /* only unreachable nodes left */ 10 break 11 remove u from Q /* u is processed */ 12 for each neighbor v of u: 13 alt := dist[u] + dist_between(u, v) 14 if alt < dist[v]: 15 dist[v] := alt 16 previous[v] := u 17 return dist[] 24
Breadth First Search Breadth first search is an iterative search algorithm within a graph. On each step, the boundary moves one step further with respect to the origin. 25
BFS & MapReduce Problem: There is a mismatch between BFS and MapReduce. Solution: Iterative traversal using MapReduce map some nodes, change the border and run MapReduce again. 26
BFS & MapReduce Problem: It s too expensive to send the whole graph to all of map tasks. Solution: Create a new graph representation. 27
Graph Representation The most straightforward representation is describing each node by the connections to its neighbors. 28
Adjacency Matrix Another classic graph representation: M[i][j]= 1 means that there is an arc from i to j. 1 2 3 4 1 0 1 0 1 2 1 0 1 1 3 0 1 0 0 4 1 0 1 0 29
Direct References Iterating through the graph requires a list of all the graph. This solution requires that the graph be shared and therefore synchronized! More complex structures are more difficult to serialize. class Node { Object data; Vector<GraphNode> arcs; Node next; } 30
Adjacency Matrix: Sparse Representation An adjacency matrix for a large graph (e.g. the web) will be full mostly with zeros. Each line will be very long. A sparse representation is needed that contains only non-null elements. 31
Sparse Representation 1: (3, 1), (18, 1), (200, 1) 2: (6, 1), (12, 1), (80, 1), (400, 1) 3: (1, 1), (14, 1) 32
Shortest Path: Intuition Let s define the solution by induction, assuming only distances of 1: DistanceTo(InitialNode) = 0 For all nodes n reachable from the InitialNode DistanceTo(n) = 1. For all other nodes m reachable from a set of nodes S, DistanceTo(m) = 1 + min(distanceto(n), n S) 33
From the Intuition to the Algorithm A map worker gets a node n as the key and (D, points) as the value: D is the distance from n to the origin. Points is the list of nodes that are reachable from n. p points, generate <p, (D+1,?)> The reduce tasks aggregates all distance to a given point p and selects the shortest one. 34
Which leads to This MapReduce step advances the search border one step. Yet to do all the BFS, MapReduce must be fed with the whole graph. Where is the list of points gone? Map workers must output the arcs <n, (?, points)> again. This algorithm is less efficient than Dijkstra s but much more scalable. 35
With Weights Most interesting graphs have arcs with values (different than 1). The algorithm is still valid but the point list must include the weight of the arcs between them. Mappers output (p, D+weight p ) instead of (p, D+1) for each node p. 36
Conclusions MapReduce makes us think differently. But allows is possible to do quite sophisticated parallelizations. It is fundamental to think about the amount of transmitted information. 37