Map-Reduce for Parallel Computing

Transcription

1 Map-Reduce for Parallel Computing Amit Jain Department of Computer Science College of Engineering Boise State University 1/53

2 Big Data, Big Disks, Cheap Computers In pioneer days they used oxen for heavy pulling, and when one ox couldn t budge a log, they didn t try to grow a larger ox. We shouldn t be trying for bigger computers, but for more systems of computers. Rear Admiral Grace Hopper. More data usually beats better algorithms. Anand Rajaraman. The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it. Tom White. 2/53

3 Units and Units Check out 3/53

4 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. 4/53

5 Introduced by Google. Used internally for all major computations on around 1m servers! of data. Amazon is leasing servers to run map reduce computations (EC2 and S3 programs). Microsoft has developed Dryad (a super set of Map-Reduce). Balihoo used it for both production and R&D. 5/53

6 MapReduce Programming Model Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory. MapReduce Specification Object. Contains names of input/output files and optional tuning parameters. The user then invokes the MapReduce function, passing it the specification object. The user s code is linked together with the MapReduce library. 6/53

7 A MapReduce Example Consider the problem of counting the number of occurrences of each word in a large collection of documents. map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterable values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(key, AsString(result)); 7/53

8 MapReduce Examples Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)> 8/53

9 More MapReduce Examples Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. Distributed Sort: The map function extracts the key from each record, and emits a <key, record> pair. The reduce function emits all pairs unchanged. This computation depends on the partitioning and ordering facilities that are provided in a MapReduce implementation. 9/53

10 MapReduce Exercises Case Analysis or Capitalization Probability: In a collection of text documents, find the percentage capitalization for each letter of the alphabet. 10/53

11 Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Features of Hadoop: Scalable: Hadoop can reliably store and process Petabytes. Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely efficient. Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 11/53

12 Hadoop Implementation Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. 12/53

13 Hadoop Servers/Daemons The Hadoop Distributed File Systems (HDFS) is implemented by a name node server on the master node and a data node server on each data node. The MapReduce framework is implemented by a job tracker on the master node and a task tracker on each worker node. worker nodes data node task tracker name node job tracker data node task tracker master node data node task tracker 13/53

14 Hadoop History Hadoop is sub-project of the Apache foundation. Receives sponsorship from Google, Yahoo, Microsoft, HP and others. Hadoop is written in Java. Hadoop MapReduce programs can be written in Java as well as several other languages. Hadoop grew out of the Nutch Web Crawler project. Hadoop programs can be developed using Eclipse/NetBeans on MS Windows or Linux platforms. To use MS Windows requires Cygwin package. MS Windows/Mac OSX are recommended for development but not as a full production Hadoop cluster. Used by Yahoo, Facebook, Amazon, RackSpace, Twitter, EBay, LinkedIn, New York Times, E-Harmony (!) and Microsoft (via acquisition of Powerset). Several cloud consulting companies like Cloudera. 14/53

15 Hadoop Map-Reduce Inputs and Outputs The Map/Reduce framework operates exclusively on < key, value > pairs, that is, the framework views the input to the job as a set of <key,value > pairs and produces a set of <key,value > pairs as the output of the job, conceivably of different types. The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. The user needs to implement a Mapper class as well as a Reducer class. Optionally, the user can also write a Combiner class. (input) < k1,v1 > map < k2,v2 > combine < k2,v2 > reduce < k3,v3 > (output) 15/53

16 How to write a Hadoop Map class (version 1) Subclass from MapReduceBase and implement the Mapper interface. public class MyMapper<K extends WritableComparable, V extends Writable> extends MapReduceBase implements Mapper<K, V, K, V> {... The Mapper interface provides a single method: public void map(k key, V val, OutputCollector<K, V> output, Reporter reporter) WriteableComparable key: Writeable value: OutputCollector output: this has the collect method to output a <key, value> pair Reporter reporter: allows the application code to permit alteration of status The Hadoop system divides the input data into logical records" and then calls map() once for each record. For text files, a record is one line of text. The key then is the byte-offset and the value is a line from the text file. For other input types, it can be defined differently. The main method is responsible for setting output key values and value types. 16/53

17 How to write a Hadoop Reduce class (version 1) Subclass from MapReduceBase and implement the Reducer interface. public class MyReducer<K extends WritableComparable, V extends Writable> extends MapReduceBase implements Reducer<K, V, K, V> {... The Reducer interface provides a single method: public void reduce(k key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter) WriteableComparable key: Iterator values: OutputCollector output: Reporter reporter: Given all the values for the key, the Reduce code typically iterates over all the values and either concatenates the values together in some way to make a large summary object, or combines and reduces the values in some way to yield a short summary value. 17/53

18 WordCount example in Java with Hadoop Problem: To count the number of occurrences of each word in a large collection of documents. /** * Counts the words in each line. * For each line of input, break the line into words and emit them as (word, 1). */ public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); 18/53

19 WordCount Example continued /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); 19/53

20 WordCount Example continued public static void main(string[] args) throws Exception { if (args.length!= 2) { printusage(); System.exit(1); JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.addInputPath(new Path(args[0])); FileOutputFormat.setOutputPath(new Path(args[1])); JobClient.runJob(conf); 20/53

21 New versus Old Hadoop API The biggest change was a large refactoring of the core MapReduce classes. Hadoop added the classes in a new package at org.apache.hadoop.mapreduce while the old classes are in the package org.apache.hadoop.mapred. The new API uses abstract classes for Mapper and Reducer that the developers can then extend. This allows classes to evolve without breaking old code. Mapper and Reducer now have a "run" method that is called once and contains the control loop for the task, which lets applications replace it. Mapper and Reducer by default are Identity Mapper and Reducer. All of the methods take Context objects as arguments that essentially unifies JobConf, Reporter and OutputCollector from the old API. The FileOutputFormats use part-r for the output of reduce 0 and part-m for the output of map 0. The reduce method passes values as a java.lang.iterable rather than java.lang.iterator, which makes it easier to use the for-each loop construct. Configuration class unifies configuration. Job control is done via the Job class. The JobClient class is gone. However not all libraries are available for the new API so we have to sometimes use the old API :-( 21/53

22 Map/Reduce in new API public class MyMapper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {... protected void map(keyin key, VALUEIN value, Mapper.Context context) throws IOException, InterruptedException public class MyReducer extends Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {... protected void reduce(keyin key, Iterable<VALUEIN> values, Reducer.Context context) throws IOException, InterruptedException 22/53

23 WordCount example with new Hadoop API Problem: To count the number of occurrences of each word in a large collection of documents. /** * Counts the words in each line. * For each line of input, break the line into words * and emit them as (word, 1). */ public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); 23/53

24 WordCount Example with new Hadoop API (contd.) /** * A reducer class that just emits the sum of the input values. */ public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); 24/53

25 WordCount Example with new Hadoop API (contd.) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 25/53

26 Case Analysis Example public class CaseAnalysis { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private final static IntWritable zero = new IntWritable(0); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); for (int i = 0; i < line.length(); i++) { if (Character.isLowerCase(line.charAt(i))) { word.set(string.valueof(line.charat(i)).touppercase()); context.write(word, zero); else if (Character.isUpperCase(line.charAt(i))) { word.set(string.valueof(line.charat(i))); context.write(word, one); else { word.set("other"); context.write(word, one); 26/53

27 Case Analysis Example (contd.) public static class Reduce extends Reducer<Text, IntWritable, Text, Text> { private Text result = new Text(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { long total = 0; int upper = 0; for (IntWritable val: values) { upper += val.get(); total++; result.set(string.format("%16d %16d %16d %16.2f", total, upper, (total - upper), ((double) upper / total))); context.write(key, result); 27/53

28 Case Analysis Example (contd.) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: hadoop jar caseanalysis.jar <in> <out>"); System.exit(2); Job job = new Job(conf, "case analysis"); job.setjarbyclass(caseanalysis.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 28/53

29 Inverted Index Example Given an input text, an inverted index program uses Mapreduce to produce an index of all the words in the text. For each word, the index has a list of all the files where the word appears. public static class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map(longwritable key, Text val, Context context) throws IOException, InterruptedException { FileSplit filesplit = (FileSplit) context.getinputsplit(); String filename = filesplit.getpath().getname(); location.set(filename); String line = val.tostring(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, location); 29/53

30 Inverted Index Example (contd.) The reduce method is shown below. public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> { public void reduce(text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { boolean first = true; StringBuilder toreturn = new StringBuilder(); while (values.hasnext()) { if (!first) toreturn.append(", "); first = false; toreturn.append(values.next().tostring()); context.write(key, new Text(toReturn.toString())); 30/53

31 Inverted Index Example (contd) public static void main(string[] args) throws IOException { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (args.length < 2) { System.out println("usage: InvertedIndex <input path> <output path>"); System.exit(1); Job job = new Job(conf, "InvertedIndex"); job.setjarbyclass(invertedindex.class); job.setmapperclass(invertedindexmapper.class); job.setreducerclass(invertedindexreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 31/53

32 MapReduce: High-Level Data Flow 32/53

33 MapReduce: Detailed Data Flow 33/53

34 MapReduce Optimizations Overlap of maps, shuffle, and sort Mapper locality Schedule mappers close to the data. Combiner Mappers may generate duplicate keys Side-effect free reducer can be run on mapper node Minimizes data size before transfer Reducer is still run Speculative execution to help with load-balancing Some nodes may be slower Run duplicate task on another node, take first answer as correct and abandon other duplicate tasks Only done as we start getting toward the end of the tasks 34/53

35 Using a Combiner to Reduce Network Traffic 35/53

36 Setting up Hadoop What s with all the versions?! 2.5.X - current stable version, 2.x release 1.2.X - current stable version, 1.2 release 0.23.X - similar to 2.X.X but missing NN HA (Name Node High Availability). 36/53

37 Setting up Hadoop Download a stable version of Hadoop from Unpack the tarball that you downloaded in previous step. tar xzvf hadoop tar.gz Edit conf/hadoop-env.sh and at least set JAVA_HOME to point to Java installation folder on your system. You will need Java version 1.6 or higher. In the lab, java is installed in /usr/java/default. Now run bin/hadoop to test Java setup. You should get output similar to shown below. [amit@kohinoor hadoop]$ bin/hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of:... Make sure that you can run ssh localhost command without a password. If you cannot, then set it up as follows: ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 37/53

38 Hadoop Running Modes We can use hadoop in three modes: Standalone mode: Everything runs in a single process. Useful for debugging. Pseudo-distributed mode: Multiple processes as in distributed mode but they all run on one host. Again useful for debugging distributed mode of operation before unleashing it on a real cluster. Distributed mode: The real thing! Multiple processes running on multiple machines. 38/53

39 Standalone Mode Hadoop comes ready to run in standalone mode out of the box. Try the following to test it. mkdir input cp conf/*.xml input bin/hadoop jar hadoop-examples*.jar wordcount input output cat output/* To revert back to standalone mode, you need to edit three config files in the conf folder: core-site.xml, hdfs-site.xml and mapred-site.xml and make them be basically empty. templates]$ cat core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> </configuration> templates]$ cat hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> </configuration> templates]$ cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> </configuration> 39/53

40 Developing Hadoop MapReduce in Eclipse Hadoop plugin for Eclipse isn t up to date so is not useful for the latest version of Eclipse or Hadoop. Create a Java project in Eclipse. Add at least the following two jar files as external jar files (found in the hadoop download) for the project: hadoop-core jar lib/commons-cli-1.2.jar Develop the MapReduce application. Then generate a jar file using the Export menu. You can also export a build file so you can generate jar files from the command line. 40/53

41 Pseudo-Distributed Mode To run in pseudo-distributed mode, we need to specify the following: The NameNode (Distributed Filesystem master) host and port. This is specified with the configuration property fs.default.name. The JobTracker (MapReduce master) host and port. This is specified with the configuration property mapred.job.tracker. The Replication Factor should be set to 1 with the property dfs.replication. We would set this to 2 or 3 or higher on a real cluster. A slaves file that lists the names of all the hosts in the cluster. The default slaves file is conf/slaves it should contain just one hostname: localhost. 41/53

42 Pseudo-Distributed Mode Config Files To run everything in one node, edit three config files so that they contain the configuration shown below: --> conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> --> conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> --> conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> 42/53

43 Pseudo-Distributed Mode Now create a new Hadoop distributed file system (HDFS) with the command: bin/hadoop namenode -format Start the Hadoop daemons. bin/start-all.sh Create user directories for your login name bin/hadoop dfs -mkdir /user bin/hadoop dfs -mkdir /user/amit Put input files into the Distributed file system. bin/hadoop dfs -put input input Now run the distributed job and copy output back to local file system. bin/hadoop jar hadoop-*-examples.jar wordcount input output bin/hadoop dfs -get output output Point your web browser to localhost:50030 to watch the Hadoop job tracker and to localhost:50070 to be able to browse the Hadoop DFS and get its status. When you are done, stop the Hadoop daemons as follows. bin/stop-all.sh 43/53

44 Port Forwarding to Access Hadoop Web Interface Use ssh port forwarding to enable access to Hadoop ports from a browser at home. Log in to onyx as follows: ssh -Y -L 50030:onyx:50030 onyx Then point browser on your local system to localhost:50030 and you will have access to the Hadoop web interface without physical presence in the lab or the slow speed of running a browser remotely. 44/53

45 Fully Distributed Hadoop Normally Hadoop runs on a dedicated cluster. In that case, the setup is a bit more complex than for the pseudo-distributed case. Specify hostname or IP address of the master server in the values for fs.default.name in core-site.xml and and mapred.job.tracker in mapred-site.xml file. These are specified as host:port pairs. The default ports are 9000 and Specify directories for dfs.name.dir and dfs.data.dir and dfs.replication in conf/hdfs-site.xml. These are used to hold distributed file system data on the master node and slave nodes respectively. Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices. Specify mapred.system.dir and mapred.local.dir in conf/hadoop-site.xml. The system directory must be accessible by server and clients. The local directory determines where temporary MapReduce data is written. It also may be a list of directories. 45/53

46 Fully Distributed Hadoop (contd.) Specify mapred.map.tasks (default value: 2) and mapred.reduce.tasks (default value: 1) in conf/mapred-site.xml.this is suitable for local or pseudo-distributed mode only. Choosing the right number of map and reduce tasks has significant effects on performance. Default Java memory size is 1000MB. This can be changed in conf/hadoop-env.sh. This is related to the parameters discussed above. See Chapter 9 in the Hadoop book by Tom White for further discussion. List all slave host names or IP addresses in your conf/slaves file, one per line. List name of master node in conf/master. 46/53

47 Sample Config Files Sample core-site.xml file. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://node00:9000</value> <description>the name of the default filesystem.</description> </property> </configuration> Sample hdfs-site.xml file. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name><value>/tmp/hadoop-amit/hdfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/tmp/hadoop-amit/hdfs/data</value> </property> </configuration> 47/53

48 Sample Config Files (contd.) Sample mapred-site.xml file. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapred.job.tracker</name> <value>hdfs://node00:9001</value> </property> <property> <name>mapred.system.dir</name> <value>/tmp/hadoop-amit/mapred/system</value> </property> <property> <name>mapred.local.dir</name> <value>/tmp/hadoop-amit/mapred/local</value> </property> <property> <name>mapred.map.tasks.maximum</name> <value>17</value> <description> The maximum number of map tasks which are run simultaneously on a given TaskTracker individually. This should be a prime number larger than multiple of the number of slave hosts. </description> </property> <property> <name>mapred.reduce.tasks.maximum</name> <value>16</value> </property> <property> <name>mapred.reduce.tasks</name> <value>7</value> </property> </configuration> 48/53

49 Sample Config Files (contd.) Sample hadoop-env.sh file. Only need two things to be defined here. # Set Hadoop-specific environment variables here. # The java implementation to use. Required. export JAVA_HOME=/usr/java/default... # The maximum amount of heap to use, in MB. Default is # export HADOOP_HEAPSIZE= /53

50 Running multiple copies of Hadoop on Onyx Cluster Normally only one copy of Hadoop runs on a cluster. In our lab setup, we want to be able to run multiple copies of Hadoop where the namenode (node00) is overloaded but each user has unique datanodes that they schedule via PBS. In order to do this, each user needs unique ports for the following: property default value config file fs.default.name hdfs://node00:9000 core-site.xml mapred.job.tracker hdfs://node00:9001 mapred-site.xml mapred.job.tracker.http.address :50030 mapred-site.xml dfs.datanode.address :50010 hdfs-site.xml dfs.datanode.ipc.address :50020 hdfs-site.xml dfs.http.address :50070 hdfs-site.xml dfs.datanode.http..address :50075 hdfs-site.xml dfs.secondary.http.address :50090 hdfs-site.xml We have provided a script cluster-pickports.sh that lets the user pick a base port and then it sets the three config files with numbers. So with base port of 60000, the eight ports get set to 60000, 60001, 60010, 60020, 60030, 60070, and The script is available via the class examples code repository at: lab/hadoop/local-scripts 50/53

51 Reference: How to use Hadoop in the lab (part 1) Download a stable version of Hadoop (version 1.2.1) or on onyx simply copy it from my folder as follows: cd ~ mkdir ~/hadoop-install cd hadoop-install/ cp ~amit/cs430/hadoop tar.gz. tar xzvf hadoop tar.gz ln -s hadoop hadoop cp ~amit/hadoop-install/local-scripts. cp ~amit/hadoop-install/myexamples. cd ~amit/hadoop-install/local-scripts svn update I also recommend adding the Hadoop bin folder to your path. For example, add the following to the end of your ~/.bashrc file and then login again (or use the command source ~/.bashrc) export PATH=~/hadoop-install/hadoop/bin:$PATH Then setup the configuration files for distributed mode of operation and modify configuration file based on your group s assigned port range (check the teams page on the class website) using the following script. cd ~/hadoop-install/local-scripts setmod.sh distributed cluster-pickports.sh <base-port> Running Hadoop in pseudo-distributed mode in the lab isn t recommended because of potential port conflicts with the real distributed Hadoop. 51/53

52 Reference: How to use Hadoop in the lab (part 2) Login to onyx via ssh to create the first console. Open a second window/console on onyx to grab some nodes from PBS. In the original window on onyx, create the cluster using the provided script. Then check on the status of the cluster. cd ~/hadoop-install/local-scripts create-hadoop-cluster.sh cd../hadoop...wait for a few seconds... bin/hadoop dfsadmin -report After you are done running jobs and you have copied the relevant output from HDFS to your local file system, destroy the Hadoop distributed file system as follows: cd ~/hadoop-install/local-scripts cluster-remove.sh Finally, exit from the PBS shell in the second console to release the nodes. 52/53

53 References MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google Inc. Appeared in: OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, Hadoop: An open source implementation of MapReduce. The main website: Hadoop: The Definitive Guide. Tom White, June 2012, O Reilly. MapReduce tutorial at Yahoo. Data Clustering using MapReduce by Makho Ngazimbi (supervised by Amit Jain). Masters in Computer Science project report, Boise State University, Hadoop and Hive as Scalable alternatives to RDBMS: A Case Study by Marissa Hollingsworth (supervised by Amit Jain). Masters in Computer Science project report, Boise State University, /53