How To Write A Mapreduce Program In Java.Io (Orchestra)

Size: px

Start display at page:

Download "How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)"

Asher Morton
5 years ago
Views:

1 MapReduce framework - Operates exclusively on <key, value> pairs, - that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job. - The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. - Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. - Input and Output types of a MapReduce job: (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

- The key and value classes have to be serializable by the framework and hence need to implement the Writable interface.

2 Example: WordCount Input: - Text File Output: - Single file containing (Word <Tab> Count) Map Phase : - Generates Word Count Pairs - { (and,1), (boy,1),(child,1),(and,1),(big,1),(dog,1),(and,1),(rat,1),(tog,1), (paint,1),(an,1),(a,1) Reduce Phase: - For each word calculates aggregates - { (and,3), (boy,1),(child,1), (big,1) (dog,1),(rat,1) (tog,1) (paint,1) (an,1)(a,1)

(boy,1),(child,1),(and,1),(big,1),(dog,1),(and,1),(rat,1),(tog,1), (paint,1),(an,1),(a,1)

3 Example: WordCount - Counts the number of occurences of each word in a given input set. package org.myorg; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one);

*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private

4 public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);

collect(key, new IntWritable(sum)); public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.

5 Usage Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar: $ mkdir wordcount_classes $ javac -classpath ${HADOOP_HOME/hadoop-${HADOOP_VERSION-core.jar -d wordcount_classes WordCount.java $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/. Assuming that: /usr/joe/wordcount/input - input directory in HDFS /usr/joe/wordcount/output - output directory in HDFS Sample text-files as input: $ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02 $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World Output: $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop Run the application: $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.wordcount /usr/joe/wordcount/input /usr/joe/wordcount/output

6 Remote Procedure Call (RPC) Many distributed systems have been based on explicit message exchange between processes. However, the procedures send and receive do not conceal communication at all,which is important to achieve access transparency in distributed systems. When a process on machine A calls' a procedure on machine B, the calling process on A is suspended, and execution of the called procedure takes place on B. Information can be transported from the caller to the callee in the parameters and can come back in the procedure result. No message passing at all is visible to the programmer. The most common framework for newer protocols and for middleware Used both by operating systems and by applications NFS is implemented as a set of RPCs DCOM, CORBA, Java RMI, etc., are just RPC systems

When a process on machine A calls' a procedure on machine B, the calling process on A is suspended, and execution of the called procedure takes place on B.

7 Remote Procedure Call (RPC) The RPC Model The model is similar to the procedure call model in used for the transfer of control and data within a program. 1. For making a procedure call, the caller places the arguments to the procedure. 2. Control is then transferred to the sequence of instructions of the procedure(called) 3. The procedure body is executed in a newly created execution environment. 4. After the procedure's execution is over, the control returns to the calling Point. The idea behind RPC is to make a remote procedure call look as much as possible like a local one. In other words, we want RPC to be transparent-the calling procedure should not be aware that the called procedure is executing on a different machine or vice versa.

The procedure body is executed in a newly created execution environment. 4. After the procedure's execution is over, the control returns to the calling Point.

8 Remote Procedure Call (RPC) Allow programs to call procedures located on other machines. Traditional (synchronous) RPC and asynchronous RPC. -It packs a specification of of procedure and arguments into a message and sends. Return Call Unpack Pack RPC Runtime Receive Send Return Call Pack Unpack RPC Runtime Send Receive - Transforms requests coming in over the network into local procedure calls. - The server stub unpacks the parameters from the Message. -When the server stub gets control back after the call has completed, it packs the result (the buffer) in a message and calls send to return it to the client. -wait for the next incoming request. RPC.

Return Call Unpack Pack RPC Runtime Receive Send Return Call Pack Unpack RPC Runtime Send Receive - Transforms requests coming in over the network into local

9 Remote Procedure Call (RPC) Remote Procedure Call is a procedure P that caller process C gets server process S to execute as if C had executed P in C's own address space RPCs support distributed computing at higher level than sockets architecture/os neutral passing of simple & complex data types common application needs like name resolution, security etc. caller process server process Call procedure and wait for reply Receive request and start procedure execution Procedure P executes Resume execution Send reply and wait for the next request

simple & complex data types common application needs like name resolution, security etc.

10 Remote Procedure Calls Messgaes -A RPC Involves two processes: Client and Server Process. - Client asks to execute a remote procedure, server executes and returns results. - Two types of messages involved in the RPC System. 1. Call Messages 2. Replay Messages - Protocol of the RPC defines the format of these messages. - RPC Protocol is independent of transport protocol. - RPC protocol only deals with the specification and interpretation of messages. client identification It is intended for a specific remote procedure. So it must have 1.Identification information of the remote procedure. 2. arguments necessary for the execution of the parameters. In addition to this 3. A message identification filed(a sequence Number) Useful in lost messages and duplicate messages in case of failures 4.Message Type Field(0- Call 1- Reply) 5. A client identification Number(for Authentication and Reply Message ) Message Identifier Message Type client identifier Remote Identifier Program Number Version No. Procedure No Arguments

- RPC protocol only deals with the specification and interpretation of messages. client identification It is intended for a specific remote procedure. So it must have 1.

11 RPC Messages : Reply Message Message Identifier Message Type Reply Status (Successful) Result Message Identifier message Type Reply status (Unsuccessful) Result - Call Message may violate protocol - Client identifier is not authorized to use service - remote program version or procedure number is not available. - Remote procedure is not able to decode the arguments. - an exception condition occurs while executing the remote procedure. The specified process is executed successfully.

service - remote program version or procedure number is not available.

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program