Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Similar documents

Getting to know Apache Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Extreme Computing. Hadoop MapReduce in more detail.

Data-intensive computing systems

Internals of Hadoop Application Framework and Distributed File System

Map Reduce & Hadoop Recommended Text:

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

University of Maryland. Tuesday, February 2, 2010

COURSE CONTENT Big Data and Hadoop Training

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data With Hadoop

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop Design and k-means Clustering

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop Architecture. Part 1

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 3 Hadoop Technical Introduction CSE 490H

Apache Hadoop. Alexandru Costan

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Chapter 7. Using Hadoop Cluster and MapReduce

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Management and NoSQL Databases

A very short Intro to Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Cloudera Certified Developer for Apache Hadoop

Introduction to MapReduce and Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Big Data Technology Core Hadoop: HDFS-YARN Internals

map/reduce connected components

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop Streaming. Table of contents

THE HADOOP DISTRIBUTED FILE SYSTEM

Xiaoming Gao Hui Li Thilina Gunarathne

Large scale processing using Hadoop. Ján Vaňo

Data-Intensive Computing with Map-Reduce and Hadoop

Parallel Processing of cluster by Map Reduce

How To Use Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Introduction to Cloud Computing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce. Tushar B. Kute,

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

HowTo Hadoop. Devaraj Das

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop and Map-Reduce. Swati Gore

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop 2.6 Configuration and More Examples

INTRODUCTION TO HADOOP

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Comparison of Different Implementation of Inverted Indexes in Hadoop

Hadoop & its Usage at Facebook

Fault Tolerance in Hadoop for Work Migration

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data and Scripting map/reduce in Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Intro to Map/Reduce a.k.a. Hadoop

The Hadoop Eco System Shanghai Data Science Meetup

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop: The Definitive Guide

Complete Java Classes Hadoop Syllabus Contact No:

Data-Intensive Computing with Hadoop. Thanks to: Milind Bhandarkar Yahoo! Inc.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop & its Usage at Facebook

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Hadoop: The Definitive Guide

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Introduction to Hadoop

MAPREDUCE Programming Model

Hadoop IST 734 SS CHUNG

Click Stream Data Analysis Using Hadoop

Hadoop Ecosystem B Y R A H I M A.

NoSQL and Hadoop Technologies On Oracle Cloud

HDFS. Hadoop Distributed File System

Transcription:

Hadoop MapReduce: Review Spring 2015, X. Zhang Fordham Univ.

Outline 1.Review of how map reduce works: the HDFS, Yarn sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer, multiple input 2.Review of MapReduce programming paradigm 1.job: default settings, 2.checklist of things to set: 3.MapReduce streaming API 4.command line options, configurations

Challenges General Problems in Big Data Era: How to process very big volume of data in a reasonable amount of time? It turns out: disk bandwidth has become bottleneck, i.e., hard disk cannot read data fast enough Solutions: parallel processing Google s problems: to crawl, analyze and rank web pages into giant inverted index (to support search engine) Google engineers went ahead to build their own systems: Google File System, exabyte-scale data management using commodity hardware Google MapReduce (GMR), implementation of design pattern applied to massively parallel processing

Hadoop History Originally Yahoo Nutch Project: crawl and index a large number of web pages Idea: a program is distributed, and process part of data stored with them Two Google papers => Hadoop project (an open source implementation of Distributed File system and MapReduce framework) Hadoop: schedule and resource management framework for execute map and reduce jobs in a cluster environment Now an open source project, Apache Hadoop Hadoop ecosystem: various tools to make it easier to use Hive, Pig: tools that can translate more abstract description of workload to map-reduce pipelines.

MapReduce End-user MapReduce API for programming MapReduce application. MapReduce framework, the runtime implementation of various phases such as map phase, sort/shuffle/merge aggregation and reduce phase. MapReduce system, which is the backend infrastructure required to run the user s MapReduce application, manage cluster resources, schedule thousands of concurrent jobs etc. 5

Hadoop: HDFS, MapReduce 6

Hadoop Daemons Hadoop (HDFS and MapReduce) is a distributed system Distributed file system Support running MapReduce program in distributed and parallel fashion Automatic input splitting, shuffling Provide fault tolerances, load balances, To suppose these, several Hadoop Deamons (processes running in background) HDFS: Namenode, datanode; MapReduce: jobtracker, resource manager, node manager These daemons communicates with each other via RPC (Remote Procedure Call) over SSH protocol. Usually allow user to view their status via Web interface Both above inter-process communication are via socket (network API) Will learn more about this later. 7

HDFS: NameNode & DataNode namenode: stores filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode. Secondary namenode regularly connects to primary namenode and keeps snapshotting filesystem metadata into local/remote storage. data node: where actual data resides Datanode stores a file block and checksum for it. update namenode with block information periodically, and before updating verify checksums. If checksum is incorrect for a particular block i.e. there is a disk level corruption for that block, it skips that block while reporting block information to namenode. => namenode replicates the block somewhere else. Send heartbeat message to namenode to say that they are alive => name node detects datanode failure, and initiates replication of blocks Datanodes can talk to each other to rebalance data, move and copy data around and keep replication high. 8

Hadoop Daemons Daemon Default Port Configuration Parameter HDFS namenode 50070 dfs.http.addre ss datanode 50075 dfs.dataname. http.address secondaryname node 50090 dfs.secondary. http.address You could open a browser to http://<ip_address_of_namenode>: 50070/to view various information about name node Plan: install a text-based Web browser on puppet, so that we can use web based user-interface. 9

Hadoop 1.x There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. jobtracker: coordinates all jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers: run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker. 10

YARN: Yet Another Resource Negotiator Resource management => a global ResourceManager Per-node resource monitor => NodeManager Job scheduling/monitoring => per-application ApplicationMaster (AM). 11

YARN: Master-slave System: ResourceManager and per-node slave, NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner. ResourceManager: ultimate authority that arbitrates resources among all applications in the system. Pluggable Scheduler, allocate resources to various running applications based on the resource requirements of the applications based on abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network etc. Per-application ApplicationMaster: negotiate resources from ResourceManager and working with NodeManager(s) to execute and monitor component tasks. 12

WebUI: for Yarn Daemons YARN DAEMON PORT Configuration name ResourceManag er 8088 yarn.resourcemanager.webapp.address NodeManager 50060 yarn.nodemanager.webapp.address URL to view status of ResouceManager: http://<ip address of RM>:8088 13

Outline 1.Review of how map reduce works: the HDFS, Yarn 2.Review of MapReduce programming paradigm 1.job: default settings, 2.checklist of things to set: sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer, multiple input 3.MapReduce streaming API 4.command line options, configurations

MapReduce End-user MapReduce API for programming MapReduce application. MapReduce framework, the runtime implementation of various phases such as map phase, sort/shuffle/merge aggregation and reduce phase. MapReduce system, which is the backend infrastructure required to run the user s MapReduce application, manage cluster resources, schedule thousands of concurrent jobs etc. 15

MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22, ] Output: a set of [key,value] pairs 16

Parallel Execution: Scaling Out A MapReduce job is a unit of work that client/user wants to be performed input data MapReduce program Configuration information Hadoop system: * divides job into map and reduce tasks. * divides input into fixed-size pieces called input splits, or splits. * Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split 17

MapReduce and HDFS Parallism of MapReduce + very high aggregate I/O bandwidth across a large cluster provided by HDFS => economics of the system are extremely compelling a key factor in the popularity of Hadoop. Keys: lack of data motion i.e. move compute to data, and do not move data to compute node via network. Specifically, MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. Benefits: reduces network I/O and keeps most of the I/O on local disk or within same rack. 18

Hadoop 1.x There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. jobtracker: coordinates all jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers: run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker. 19

YARN: Yet Another Resource Negotiator Resource management => a global ResourceManager Per-node resource monitor => NodeManager Job scheduling/monitoring => per-application ApplicationMaster (AM). Hadoop Deamons are Java processes, running in background, talking to other via RPC over SSH protocol. 20

YARN: Master-slave System: ResourceManager and per-node slave, NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner. ResourceManager: ultimate authority that arbitrates resources among all applications in the system. Pluggable Scheduler, allocate resources to various running applications based on the resource requirements of the applications based on abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network etc. Per-application ApplicationMaster: negotiate resources from ResourceManager and working with NodeManager(s) to execute and monitor component tasks. 21

Outline 1.Review of how map reduce works: the HDFS, Yarn sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer, multiple input 2.Review of MapReduce programming paradigm 1.job: default settings, 2.checklist of things to set: 3.MapReduce streaming API 4.command line options, configurations

MapReduce Programming InputFormat Split Partition: hash key to reducer task Shuffle ile (text, binary) atabase Input: a set of [key,value] pairs intermediate [key,value] pairs [k,(v1,v2, )] [k,(v1,v2, )] OutputFormat Output: a set of [key,value] pairs 23

Hadoop Streaming API A generic API for MapReduce framework Mappers/Reducers can be written in any language, or some Unix commands Mappers/Reducers act as filters : receive input and output on stdin and stdout For text processing: each <key,value> pair takes a line, Key/value separated by 'tab' character. Mapper/Reducer reads each line (<key,value> pair) from stdin, processes it, and writes to stdout a line (<key,value> pair). Splitting Shuffling 24

Usage of InputFormat An InputFormat is responsible for creating input splits and dividing them into records: public abstract class InputFormat<K, V> { public abstract List<InputSplit> getsplits(jobcontext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createrecordreader(inputsplit split, TaskAttemptContext context) throws IOException, InterruptedException; } Default input format is TextInputFormat, which produces one (key,value) pair for each line in the text file We will later look at a customized InputFormat class 25

InputSplit Input split is a chunk of the input that is processed by a single map. Each split is divided into records, and map processes each record a key-value pair in turn. public abstract class InputSplit { public abstract long getlength() throws IOException, InterruptedException; public abstract String[] getlocations() throws IOException,InterruptedException; } An InputSplit has a length in bytes and a set of storage locations, host- name strings. 26

Starting map tasks 1. Client running job calculates the splits for the job by calling getsplits(), 2. Client sends splits to jobtracker/resourcemanager, which uses their storage locations to schedule map tasks to process them 3. Each map task passes split to createrecordreader() method on InputFormat to obtain a RecordReader for that split. A RecordReader is little more than an iterator over records, Map task uses RecordReader to generate key-value pairs, and passes them to map function. public void run(context context) throws IOException, InterruptedException { setup(context); while (context.nextkeyvalue()) { } map(context.getcurrentkey(), context.getcurrentvalue(), context); cleanup(context); } default implementation of Mapper class s run function 27

Split and Record Sometimes a record span two blocks/input splits: map task #1: located on same node as first block/split of file needs to perform remote read to obtain record 5 (which spans two blocks) 28

InputFormat Hierarchy 29

FileInputFormat Parent class for all that read from files Input to a job: a collection of paths: void addinputpath( ) void addinputpaths( ) void setinputpaths( ) void setinputpaths( ) 30

Split Size FileInputFormat splits only large files. Here large means larger than an HDFS block. by default, minimumsize < blocksize < MaximumSize Formula for split size: max (minimumsize, min (maxmumsize, blocksize)) 31

Demo: How to? Change MaximumSplit size so that a file smaller than BlockSize is splitter? ~/hadoop_samplecodes/citibike/shellscript/ RunDefaultJob.sh -dmapred.max.split.size=20000 Try./RunDefaultJob.sh >& dd grep 2014-04.csv dd wc -l => 655 map tasks 2015-03-30 15:38:40,425 INFO [main] input.fileinputformat (FileInputFormat.java:listStatus(245)) - Total input paths to process : 1 2015-03-30 15:38:40,496 INFO [main] mapreduce.jobsubmitter (JobSubmitter.java:submitJobInternal(371)) - number of splits:6552015-03-30 15:25:49,678 INFO [Thread-2] mapred.merger (Merger.java:merge(568)) - Merging 655 sorted segments 32

Demo: How to configure Do not work in local mode number of reducers? See LocalJobRunner.java code: sequentially run the map tasks, and then start one reduce tasks in command line: -Dmapred.reduce.tasks=2 in code: job.setnumreducetasks(2); ~/hadoop_samplecodes/citibike/shellscript/ RunDefaultJob.sh -dmapred.max.split.size=20000 RunDefaultJob_pseudo.sh 33

Shuffling InputFormat Split Partition: hash key to reducer task Shuffle ile (text, binary) atabase Input: a set of [key,value] pairs intermediate [key,value] pairs [k,(v1,v2, )] [k,(v1,v2, )] OutputFormat Output: a set of [key,value] pairs 34

Which reduce task? For each (K2,V2), intermediate key-value pair, which reduce task to go to? Partition the whole domain of K2 into multiple partitions each reduce tasks process one partition partition function operates on intermediate key and value types (K2 and V2), and returns partition index. In practice, partition is determined solely by key (value is ignored): Default partitioner is HashPartioner public class HashPartitioner<K, V> extends Partitioner<K, V> { } public int getpartition(k key, V value, int numreducetasks) { return (key.hashcode() & Integer.MAX_VALUE) % numreducetasks; } 35

ChainMapper class ChainMapper class: use multiple Mapper classes within a single Map task. mapper1 => mapper2 => mapper3 =>lastmapper output of the first becomes the input of the second, and so on until the last Mapper output of the last Mapper will be written to the task's output. Benefit: Modularity (simple and reusable specialized Mappers) Composibility (mapper can combined to perform composite operations) reduction in disk IO: compared to multiple chained map reduce jobs 36

a word count job Job job = Job.getInstance(); Configuration splitmapconfig = new Configuration(false); ChainMapper.addMapper(job, SplitMapper.class, LongWritable.class, Text.class, Text.class, IntWritable.class, splitmapconfig); Configuration lowercasemapconfig = new Configuration(false); ChainMapper.addMapper(job, LowerCaseMapper.class, Text.class, IntWritable.class, Text.class, IntWritable.class, lowercasemapconfig); job.setjarbyclass(chainmapperdriver.class); job.setcombinerclass(chainmapreducer.class); job.setreducerclass(chainmapreducer.class); 37

forming a chain of mapper public static <K1,V1,K2,V2> void addmapper(jobconf job," Class<? extends Mapper<K1,V1,K2,V2>> klass," Class<? extends K1> inputkeyclass," Class<? extends V1> inputvalueclass," Class<? extends K2> outputkeyclass," Class<? extends V2> outputvalueclass," boolean byvalue," JobConf mapperconf)" Adds a Mapper class to the chain job's JobConf. byvalue - indicates if key/values should be passed by value to the next Mapper in the chain (if any) or by reference. If a Mapper leverages the assumed semantics that the key and values are not modified by the collector 'by value' must be used. If the Mapper does not expect this semantics, as an optimization to avoid serialization and deserialization 'by reference' can be used. MPORTANT: There is no need to specify the output key/value classes for the ChainMapper, this is done by the addmapper for the last mapper in the chain 38

SequenceFile SequenceFile: provides a persistent data structure for binary keyvalue pairs keys and values stored in a SequenceFile do not necessarily need to be Writable. Any types that can be serialized and deserialized by a Serialization may be used. convert an object/value to/from a byte stream, In contrast, default TextOutputFormat writes key, value by calling tostring() method on them convert an object/value to/from a stream of text (e.g., ASCII) 39

MapFile A file-based map from keys to values. A map is a directory containing two files data file, containing all keys and values in the map index file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval(). Index file is read entirely into memory. Thus key implementations should try to keep themselves small. Allowing for quick lookup of record Exercise: run the MapReduce job that uses MapFileOutputFormat class, examine the output directory 40

Sorting in MapReduce Framework 41

All about sorting Partial sort comes free, see previous slides output from map => partitioned, sorted, and merged (sorted) => reduce each partition is sorted, i.e., each reduce task output is sorted To sort globally (total sort) one partition, i.e., one reduce task or, use customized partition class 42

All about sorting Total sort: outputs of reduce tasks can be concatenated together to get a sorted output Ideas: Use TotalOrderPartition class, if k1<k2, partition(k1)<partition(k2) How to make sure the partitioning is balanced, to minimize running time? use InputSampler to sample output to get an estimated distribution of keys 43

Secondary sort Goal: sort output by year, and then within the year, sort by temperature 1901 120 1901 123 1902 89 1902 111 You can do this by writing a reduce class intermediate key-value pairs are grouped: reduce (k, <v1,v2, >), but v1,v2, are not sorted Or we can again take advantage of MapReduce s framework (how it already partitions, sorts, groups data for us) 44

How? Goal: sort output by year, and then within the year, sort by temperature Plan: Use year and temperature as key (a composite key) Partition and group based on year only so that records of same year are sent to same reduce tasks, and grouped together in a list Sort based on the composite key (year and temperature so that the ordering of records (within same group) are ordered by temperature 45

Details PartitionerClass: FirstPartitioner class, uses only first part (e.g., year) of composite key 46

Details setgroupingcomparatorclass Define the comparator that controls which keys are grouped together for a single call to Reducer.reduce function" Use GroupComparator which just comparing first part of key (e.g., year) 47

Details setsortcomparatorclass: " " Define the comparator that controls how the keys are sorted before they are passed to the Reducer." We use KeyCompartor class, sort by first part (e.g., year), and then by second part of composite key (e.g., temperature) SortComparator 1900 34 1900 34 1900 34 1901 35 1901 36 48

Outline MapReduce: how to control number of tasks InputFormat class decides # of splits => number of map tasks reduce tasks # configured by client ChainMapper: modular design Input processing: XML file processing whole file as a record Binary output & sorting Join

Join Combine two datasets together using a key Here, use StationID 50

An example of reduce-side joins Multiple inputs, of different formats e.g., one station record, another weather data mapper class: tag record with composite key, e.g., station_id-0 for station record, station_id-1 for weather record Secondary sort: use first part of composite key to partition and group 51

Commonly Used Mapper/ Reducer 52

Debugging MapReduce job and task logs 53

User-level logs 54

MapReduce Programming InputFormat Split Partition: hash key to reducer task Shuffle ile (text, binary) atabase Input: a set of [key,value] pairs intermediate [key,value] pairs [k,(v1,v2, )] [k,(v1,v2, )] OutputFormat Output: a set of [key,value] pairs 55

What happens? Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs. Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. " Framework first calls setup(org.apache.hadoop.mapreduce.mapper.context), followed by map(object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(context) is called. All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine final output. Users can control sorting and grouping by specifying two key RawComparator classes. Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner. Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of intermediate outputs, which helps to cut down amount of data transferred from the Mapper to the Reducer. 56

MapReduce Programming Map: input key/value type (can be int, long, text, ) output key/value type (can be int, long, text, ) reducer: input key/value type (same as mapper s output key/value type) output key/value type (can be int, long, text, ) 57

a MapReduce Job Class Job: org.apache.hadoop.mapreduce.job All Implemented Interfaces: JobContext, org.apache.hadoop.mapreduce.mrjobconfig public class Job" extends org.apache.hadoop.mapreduce.task.jobcontextimpl" implements JobContext" The job submitter's view of the Job. It allows the user to configure the job: using set***** methods, only work until the job is submitted, afterwards they will throw an IllegalStateException. submit it, control its execution, and query the state. Normally the user creates the application, describes various facets of the job via Job and then submits the job and monitor its progress. 58

a MapReduce Job Here is an example on how to submit a job: // Create a new Job" Job job = Job.getInstance();" job.setjarbyclass(myjob.class);" " // Specify various job-specific parameters " job.setjobname("myjob");" " job.setinputpath(new Path("in"));" job.setoutputpath(new Path("out"));" " job.setmapperclass(myjob.mymapper.class);" job.setreducerclass(myjob.myreducer.class);" // Submit the job, then poll for progress until the job is complete" job.waitforcompletion(true); 59

Mapper class s KEYIN must be consistent with inputformat.class Mapper class s KEYOUT must be consistent with map.out.key.class 60

a minimalmapreduce Job Try to run the minimalmapreduce job Compare it with the WithDefaults j 61

Job: default settings 62

Default Streaming Job More stripped down streaming job: Equivalently, 63

Usage of streaming separators 64

65

Configuration class Hadoop loads hdfs-default.xml file from the classpath resources, which the jar helps supply. Then we get the "default" values for several configs ready. After this, we load hdfs-site.xml from the classpath again (which resides mostly at /etc/hadoop/conf/ directory) and apply overrides into the default config object. [zhang@puppet ~]$ hadoop classpath /etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoophdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/ hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//* The default XML files reside inside the hadoop-common and hadoop-hdfs jars. You do not need these files explicitly on the classpath since they are read from the jars itself and they should never be modified. reference online: core-default.xml is at http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-common/core-default.xml hdfs-default.xml is at http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml 66