MapReduce Détails Optimisation de la phase Reduce avec le Combiner

Size: px

Start display at page:

Download "MapReduce Détails Optimisation de la phase Reduce avec le Combiner"

Charity Spencer
8 years ago
Views:

1 MapReduce Détails Optimisation de la phase Reduce avec le Combiner S'il est présent, le framework insère le Combiner dans la pipeline de traitement sur les noeuds qui viennent de terminer la phase Map. Le Combiner est exécuté après la phase Map, mais avant que les données intermédiaires sont envoyées vers d'autres noeuds. Le Combiner reçoit les données produites par la phase Map sur un noeud. Il reçoit seulement les données locales, pas celles des autres noeuds. Il produit des paires clé-valeur qui seront envoyées vers les Reducers. Le Combiner peut être utilisé dans les cas où on peut déjà commencer le Reduce sans avoir toutes les données. P. ex. le calcul de température maximale s'y prête très bien. Le Combiner calcule la température maximale pour les données disponibles sur le noeud local. Au lieu d'envoyer les paires (1949, 111) et (1949, 78) vers les Reducers on envoie seulement la paire (1949, 111). 41 Distributed file system HDFS HDFS design decisions: Files stored as chunks Fixed size (64MB) Reliability through replication Application HDFS client Each chunk replicated across 3+ nodes Single master to coordinate access, keep metadata Simple centralized management No data caching Little benefit due to large datasets, streaming reads Simplify the API Push some of the issues onto the client (e.g., data layout) HDFS namenode File namespace /foo/bar block 3d2f HDFS datanode Linux file system HDFS datanode Linux file system 42

2 Namenode responsibilities Managing the file system namespace: Holds file/directory structure, metadata, file-toblock mapping, access permissions, etc. Coordinating file operations: Directs clients to datanodes for reads and writes No data is moved through the namenode Maintaining overall health: Periodic communication with the datanodes Block re-replication and rebalancing Garbage collection 43 Putting everything together Per cluster: One Namenode (NN): master node for HDFS One Jobtracker (JT): master node for job submission Per slave machine: One Tasktracker (TT): contains multiple task slots One Datanode (DN): serves HDFS data blocks master node jobtracker namenode slave node datanode Web UI at Web UI at slave node datanode Server MapReduce HDFS slave node tasktracker tasktracker tasktracker datanode 44

3 Important counters Phase Measure Counter name Map Shuffle and sort Reduce Number of input records consumed by all mappers Number of key/value pairs produced by all mappers The number of bytes of map output copied by the shuffle to reducers (may be compressed) Number of unique keys fed into the reducers Number of key/value pairs produced by all reducers Map input records Map output records Reduce shuffle bytes Reduce input groups Reduce output records 45 Hadoop Streaming Writing MapReduce in scripting languages (Python, Ruby, ) To write Map and Reduce functions in other languages than Java there is the Hadoop Streaming API. Uses Unix standard streams as interface between Hadoop and your program. Your program reads data from standard input and writes data to standard output. All data is in text format: Mapper The original input data needs to be a text file The key-value pairs Receives the file to be processed as lines of text. Writes the output key-value pairs as lines of text. One pair on one line, key and value separated by tab character. Reducer Receives the input key-value pairs as lines of text. One line contains one key and one value, separated by tabs. If a pair has multiple values, the key is repeated on several lines. Writes output key-value paris as lines of text. 46

4 Hadoop Streaming Example Mapper in Python for maximum temperature #!/usr/bin/env python # # max_temperature_map.py - Calculate maximum temperature from NCDC Global # Hourly Data - Mapper part import re # import regular expressions import sys # import system-specific parameters and functions # loop through the input, line by line for line in sys.stdin: # remove leading and trailing whitespace val = line.strip() # extract values for year, temperature and quality indicator (year, temp, q) = (val[15:19], val[87:92], val[92:93]) # temperature is valid if not and quality indicator is # one of 0, 1, 4, 5 or 9 if (temp!= "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp) 47 Hadoop Streaming Example Reducer in Python for maximum temperature #!/usr/bin/env python # # max_temperature_reduce.py - Calculate maximum temperature from NCDC Global # Hourly Data - Reducer part import sys (last_key, max_val) = (None, -sys.maxint) # loop through the input, line by line for line in sys.stdin: # each line contains a key and a value separated by a tab character (key, val) = line.strip().split("\t") # Hadoop has sorted the input by key, so we get the values # for the same key immediately one after the other. # Test if we just got a new key, in this case output the maximum # temperature for the previous key and reinitialize the variables. # If not, keep calculating the maximum temperature. if last_key and last_key!= key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) # we've reached the end of the file, output what is left if last_key: print "%s\t%s" % (last_key, max_val) 48

5 Python Essential concepts A Python script always starts with the line #!/usr/bin/env python No ; at the end of statements Variables do not need to be declared before they are used: temperature = 21.4 Variables have types (int, float, bool, string, ), but the type of a variable does not need to be declared, it is automatically derived from the value: my_int = 12 my_float = 21.4 my_string = "Hello" A variable can also not have a value by using the built-in constant None my_string = None Control structures (if, while, for, ) use indentation instead of braces { } or keywords (do done) to group statements: if temperature > 27.5: print "It is getting too hot." print "Get a drink." elif temperature < 2.5: print "It is getting too cold." else: print "Temperature OK." words = [ 'how', 'are', 'you' ] for w in words: print w, len(w) Assignments can be done in tuples: (d, e, f) = (a, b/2, c+3) 49 Python Essential concepts String operations Split standard input into lines for line in sys.stdin: print line Remove whitespace at beginning and end of string: stripped = line.strip() Split a string into fields based on a delimiter character: (key, val) = str.split("\t") Extract substring: substr = str[12:18] Match regular expression: if re.match("abc", input): print "Found abc in input" Formatted output Similar to C, format string followed by values print "%s\t%s" % (string1, string2) 50

6 Python Essential concepts Python tutorial: Python documentation: 51

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets