Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış
Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details Reduce Side Details Future Concepts Demo Q&A
Map Reduce Concepts Basic Idea In the Schema: Input data is splitted into partitions and processed in parallel, then output from mapper tasks are collected in reduce tasks, final computations done in reducer part and output is prepared
Map Reduce Concepts JobClient: Client agent that resides in hadoop-client.jar, starts communication with JobTracker and submits job JobTracker: A JobTracker is the service within Hadoop that assigns MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. Also keeps track of tasks and input data. TaskTracker: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
Map Reduce Concepts http://aws.typepad.com/files/mapreduce.gif
Map Reduce Internal Mapper Class public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable( ); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); } } }
Map Reduce Internal Reducer Class public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Map Reduce Internal Main Class public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }
Job Execution Workflow Image Source: Hadoop: The Definitive Guide Book, P: 168
Job Execution Workflow 1. JobClient.runJob(conf); JobClient submits job and MapReduce Program starts. 2. JobClient asks JobTracker for a new job id. 3. JobClient checks whether output directory exist. (new Path(args[1]) in our case) 4. JobClient computes input splits. (Input directory check also happens here, new Path(args[0]))
Job Execution Workflow 5. Copies resources needed to run the job, including job JAR file, conf file, input splits to the JobTrackers File System in a directory named with the job ID. The job jar is also copied with a high replication factor (mapred.submit.replication) across cluster. (Good Question: To Which worker machines?) 6. Tells JobTracker job is ready to run. 7. JobTracker puts job to an internal queue where JobScheduler picks the job and initializes it.
Job Execution Workflow 8. JobScheduler retrieves the input splits from shared file system (HDFS). 9. JobScheduler creates one map task for each split (Here split means block data). 10. JobScheduler sets number of reduce tasks from (mapred.reduce.tasks) property and creates this number of reduce tasks. (If this property is not set). 11. JobScheduler gives IDs to all tasks at this point.
Job Execution Workflow 12. Task assignments starts. How to know which TaskTrackers are ready to run tasks? TaskTrackers send heartbeat to JobTracker periodically. As a part of heartbeat TaskTracker indicates whether it is ready to run. 13. Assignments are done with priority of map tasks. 14. JobTracker assigns map tasks to TaskTrackers that are more close to related data. There are three options: data-local, rack-local and remote. Built-in Containers holds this statistics
Job Execution Workflow 15. Task execution starts in each TaskTracker. 16. TaskTracker copies JAR to its local file system. 17. TaskTracker creates a folder for the task. 18. TaskTracker creates instance of TaskRunner. 19. TaskRunner launches new JVM for each task
Job Execution Workflow 20. TaskTracker updates JobTracker about progress of tasks 21. Job succeeds if all tasks in each TaskTracker finishes successfully. 22. JobTracker send notification to JobClient about job status via http etc..
Map Side Details Each map task has memory buffer that it writes output to. The buffer is 100MB (io.sort.mb) io.sort.spill.percent default %80 Thread will start to spill the content to the disk to the directory specified by mapred.local.dir Each time the memory buffer reaches the spill threshold, a new spill file is created
Reduce Side Details Map output files are sitting in local disk of TaskTracker. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. As map tasks complete, they notify their TaskTracker of status update. TaskTracker notifies JobTracker by heartbeat. Therefore, JobTracker knows the mapping between map outputs and TaskTrackers. In reduce phase, the reduce function runs on each key and saves output to the HDFS (generally).
Future Concepts Resource Management (YARN) Security
Demo