Map-Reduce for Parallel Computing

Size: px
Start display at page:

Download "Map-Reduce for Parallel Computing"

Transcription

1 Map-Reduce for Parallel Computing Amit Jain Department of Computer Science College of Engineering Boise State University 1/53

2 Big Data, Big Disks, Cheap Computers In pioneer days they used oxen for heavy pulling, and when one ox couldn t budge a log, they didn t try to grow a larger ox. We shouldn t be trying for bigger computers, but for more systems of computers. Rear Admiral Grace Hopper. More data usually beats better algorithms. Anand Rajaraman. The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it. Tom White. 2/53

3 Units and Units Check out 3/53

4 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. 4/53

5 Introduced by Google. Used internally for all major computations on around 1m servers! of data. Amazon is leasing servers to run map reduce computations (EC2 and S3 programs). Microsoft has developed Dryad (a super set of Map-Reduce). Balihoo used it for both production and R&D. 5/53

6 MapReduce Programming Model Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory. MapReduce Specification Object. Contains names of input/output files and optional tuning parameters. The user then invokes the MapReduce function, passing it the specification object. The user s code is linked together with the MapReduce library. 6/53

7 A MapReduce Example Consider the problem of counting the number of occurrences of each word in a large collection of documents. map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterable values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(key, AsString(result)); 7/53

8 MapReduce Examples Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)> 8/53

9 More MapReduce Examples Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. Distributed Sort: The map function extracts the key from each record, and emits a <key, record> pair. The reduce function emits all pairs unchanged. This computation depends on the partitioning and ordering facilities that are provided in a MapReduce implementation. 9/53

10 MapReduce Exercises Case Analysis or Capitalization Probability: In a collection of text documents, find the percentage capitalization for each letter of the alphabet. 10/53

11 Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Features of Hadoop: Scalable: Hadoop can reliably store and process Petabytes. Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely efficient. Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 11/53

12 Hadoop Implementation Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. 12/53

13 Hadoop Servers/Daemons The Hadoop Distributed File Systems (HDFS) is implemented by a name node server on the master node and a data node server on each data node. The MapReduce framework is implemented by a job tracker on the master node and a task tracker on each worker node. worker nodes data node task tracker name node job tracker data node task tracker master node data node task tracker 13/53

14 Hadoop History Hadoop is sub-project of the Apache foundation. Receives sponsorship from Google, Yahoo, Microsoft, HP and others. Hadoop is written in Java. Hadoop MapReduce programs can be written in Java as well as several other languages. Hadoop grew out of the Nutch Web Crawler project. Hadoop programs can be developed using Eclipse/NetBeans on MS Windows or Linux platforms. To use MS Windows requires Cygwin package. MS Windows/Mac OSX are recommended for development but not as a full production Hadoop cluster. Used by Yahoo, Facebook, Amazon, RackSpace, Twitter, EBay, LinkedIn, New York Times, E-Harmony (!) and Microsoft (via acquisition of Powerset). Several cloud consulting companies like Cloudera. 14/53

15 Hadoop Map-Reduce Inputs and Outputs The Map/Reduce framework operates exclusively on < key, value > pairs, that is, the framework views the input to the job as a set of <key,value > pairs and produces a set of <key,value > pairs as the output of the job, conceivably of different types. The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. The user needs to implement a Mapper class as well as a Reducer class. Optionally, the user can also write a Combiner class. (input) < k1,v1 > map < k2,v2 > combine < k2,v2 > reduce < k3,v3 > (output) 15/53

16 How to write a Hadoop Map class (version 1) Subclass from MapReduceBase and implement the Mapper interface. public class MyMapper<K extends WritableComparable, V extends Writable> extends MapReduceBase implements Mapper<K, V, K, V> {... The Mapper interface provides a single method: public void map(k key, V val, OutputCollector<K, V> output, Reporter reporter) WriteableComparable key: Writeable value: OutputCollector output: this has the collect method to output a <key, value> pair Reporter reporter: allows the application code to permit alteration of status The Hadoop system divides the input data into logical records" and then calls map() once for each record. For text files, a record is one line of text. The key then is the byte-offset and the value is a line from the text file. For other input types, it can be defined differently. The main method is responsible for setting output key values and value types. 16/53

17 How to write a Hadoop Reduce class (version 1) Subclass from MapReduceBase and implement the Reducer interface. public class MyReducer<K extends WritableComparable, V extends Writable> extends MapReduceBase implements Reducer<K, V, K, V> {... The Reducer interface provides a single method: public void reduce(k key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter) WriteableComparable key: Iterator values: OutputCollector output: Reporter reporter: Given all the values for the key, the Reduce code typically iterates over all the values and either concatenates the values together in some way to make a large summary object, or combines and reduces the values in some way to yield a short summary value. 17/53

18 WordCount example in Java with Hadoop Problem: To count the number of occurrences of each word in a large collection of documents. /** * Counts the words in each line. * For each line of input, break the line into words and emit them as (word, 1). */ public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); 18/53

19 WordCount Example continued /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); 19/53

20 WordCount Example continued public static void main(string[] args) throws Exception { if (args.length!= 2) { printusage(); System.exit(1); JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.addInputPath(new Path(args[0])); FileOutputFormat.setOutputPath(new Path(args[1])); JobClient.runJob(conf); 20/53

21 New versus Old Hadoop API The biggest change was a large refactoring of the core MapReduce classes. Hadoop added the classes in a new package at org.apache.hadoop.mapreduce while the old classes are in the package org.apache.hadoop.mapred. The new API uses abstract classes for Mapper and Reducer that the developers can then extend. This allows classes to evolve without breaking old code. Mapper and Reducer now have a "run" method that is called once and contains the control loop for the task, which lets applications replace it. Mapper and Reducer by default are Identity Mapper and Reducer. All of the methods take Context objects as arguments that essentially unifies JobConf, Reporter and OutputCollector from the old API. The FileOutputFormats use part-r for the output of reduce 0 and part-m for the output of map 0. The reduce method passes values as a java.lang.iterable rather than java.lang.iterator, which makes it easier to use the for-each loop construct. Configuration class unifies configuration. Job control is done via the Job class. The JobClient class is gone. However not all libraries are available for the new API so we have to sometimes use the old API :-( 21/53

22 Map/Reduce in new API public class MyMapper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {... protected void map(keyin key, VALUEIN value, Mapper.Context context) throws IOException, InterruptedException public class MyReducer extends Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {... protected void reduce(keyin key, Iterable<VALUEIN> values, Reducer.Context context) throws IOException, InterruptedException 22/53

23 WordCount example with new Hadoop API Problem: To count the number of occurrences of each word in a large collection of documents. /** * Counts the words in each line. * For each line of input, break the line into words * and emit them as (word, 1). */ public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); 23/53

24 WordCount Example with new Hadoop API (contd.) /** * A reducer class that just emits the sum of the input values. */ public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); 24/53

25 WordCount Example with new Hadoop API (contd.) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 25/53

26 Case Analysis Example public class CaseAnalysis { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private final static IntWritable zero = new IntWritable(0); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); for (int i = 0; i < line.length(); i++) { if (Character.isLowerCase(line.charAt(i))) { word.set(string.valueof(line.charat(i)).touppercase()); context.write(word, zero); else if (Character.isUpperCase(line.charAt(i))) { word.set(string.valueof(line.charat(i))); context.write(word, one); else { word.set("other"); context.write(word, one); 26/53

27 Case Analysis Example (contd.) public static class Reduce extends Reducer<Text, IntWritable, Text, Text> { private Text result = new Text(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { long total = 0; int upper = 0; for (IntWritable val: values) { upper += val.get(); total++; result.set(string.format("%16d %16d %16d %16.2f", total, upper, (total - upper), ((double) upper / total))); context.write(key, result); 27/53

28 Case Analysis Example (contd.) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: hadoop jar caseanalysis.jar <in> <out>"); System.exit(2); Job job = new Job(conf, "case analysis"); job.setjarbyclass(caseanalysis.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 28/53

29 Inverted Index Example Given an input text, an inverted index program uses Mapreduce to produce an index of all the words in the text. For each word, the index has a list of all the files where the word appears. public static class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map(longwritable key, Text val, Context context) throws IOException, InterruptedException { FileSplit filesplit = (FileSplit) context.getinputsplit(); String filename = filesplit.getpath().getname(); location.set(filename); String line = val.tostring(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, location); 29/53

30 Inverted Index Example (contd.) The reduce method is shown below. public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> { public void reduce(text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { boolean first = true; StringBuilder toreturn = new StringBuilder(); while (values.hasnext()) { if (!first) toreturn.append(", "); first = false; toreturn.append(values.next().tostring()); context.write(key, new Text(toReturn.toString())); 30/53

31 Inverted Index Example (contd) public static void main(string[] args) throws IOException { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (args.length < 2) { System.out println("usage: InvertedIndex <input path> <output path>"); System.exit(1); Job job = new Job(conf, "InvertedIndex"); job.setjarbyclass(invertedindex.class); job.setmapperclass(invertedindexmapper.class); job.setreducerclass(invertedindexreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 31/53

32 MapReduce: High-Level Data Flow 32/53

33 MapReduce: Detailed Data Flow 33/53

34 MapReduce Optimizations Overlap of maps, shuffle, and sort Mapper locality Schedule mappers close to the data. Combiner Mappers may generate duplicate keys Side-effect free reducer can be run on mapper node Minimizes data size before transfer Reducer is still run Speculative execution to help with load-balancing Some nodes may be slower Run duplicate task on another node, take first answer as correct and abandon other duplicate tasks Only done as we start getting toward the end of the tasks 34/53

35 Using a Combiner to Reduce Network Traffic 35/53

36 Setting up Hadoop What s with all the versions?! 2.5.X - current stable version, 2.x release 1.2.X - current stable version, 1.2 release 0.23.X - similar to 2.X.X but missing NN HA (Name Node High Availability). 36/53

37 Setting up Hadoop Download a stable version of Hadoop from Unpack the tarball that you downloaded in previous step. tar xzvf hadoop tar.gz Edit conf/hadoop-env.sh and at least set JAVA_HOME to point to Java installation folder on your system. You will need Java version 1.6 or higher. In the lab, java is installed in /usr/java/default. Now run bin/hadoop to test Java setup. You should get output similar to shown below. [amit@kohinoor hadoop]$ bin/hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of:... Make sure that you can run ssh localhost command without a password. If you cannot, then set it up as follows: ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 37/53

38 Hadoop Running Modes We can use hadoop in three modes: Standalone mode: Everything runs in a single process. Useful for debugging. Pseudo-distributed mode: Multiple processes as in distributed mode but they all run on one host. Again useful for debugging distributed mode of operation before unleashing it on a real cluster. Distributed mode: The real thing! Multiple processes running on multiple machines. 38/53

39 Standalone Mode Hadoop comes ready to run in standalone mode out of the box. Try the following to test it. mkdir input cp conf/*.xml input bin/hadoop jar hadoop-examples*.jar wordcount input output cat output/* To revert back to standalone mode, you need to edit three config files in the conf folder: core-site.xml, hdfs-site.xml and mapred-site.xml and make them be basically empty. templates]$ cat core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> </configuration> templates]$ cat hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> </configuration> templates]$ cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> </configuration> 39/53

40 Developing Hadoop MapReduce in Eclipse Hadoop plugin for Eclipse isn t up to date so is not useful for the latest version of Eclipse or Hadoop. Create a Java project in Eclipse. Add at least the following two jar files as external jar files (found in the hadoop download) for the project: hadoop-core jar lib/commons-cli-1.2.jar Develop the MapReduce application. Then generate a jar file using the Export menu. You can also export a build file so you can generate jar files from the command line. 40/53

41 Pseudo-Distributed Mode To run in pseudo-distributed mode, we need to specify the following: The NameNode (Distributed Filesystem master) host and port. This is specified with the configuration property fs.default.name. The JobTracker (MapReduce master) host and port. This is specified with the configuration property mapred.job.tracker. The Replication Factor should be set to 1 with the property dfs.replication. We would set this to 2 or 3 or higher on a real cluster. A slaves file that lists the names of all the hosts in the cluster. The default slaves file is conf/slaves it should contain just one hostname: localhost. 41/53

42 Pseudo-Distributed Mode Config Files To run everything in one node, edit three config files so that they contain the configuration shown below: --> conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> --> conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> --> conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> 42/53

43 Pseudo-Distributed Mode Now create a new Hadoop distributed file system (HDFS) with the command: bin/hadoop namenode -format Start the Hadoop daemons. bin/start-all.sh Create user directories for your login name bin/hadoop dfs -mkdir /user bin/hadoop dfs -mkdir /user/amit Put input files into the Distributed file system. bin/hadoop dfs -put input input Now run the distributed job and copy output back to local file system. bin/hadoop jar hadoop-*-examples.jar wordcount input output bin/hadoop dfs -get output output Point your web browser to localhost:50030 to watch the Hadoop job tracker and to localhost:50070 to be able to browse the Hadoop DFS and get its status. When you are done, stop the Hadoop daemons as follows. bin/stop-all.sh 43/53

44 Port Forwarding to Access Hadoop Web Interface Use ssh port forwarding to enable access to Hadoop ports from a browser at home. Log in to onyx as follows: ssh -Y -L 50030:onyx:50030 onyx Then point browser on your local system to localhost:50030 and you will have access to the Hadoop web interface without physical presence in the lab or the slow speed of running a browser remotely. 44/53

45 Fully Distributed Hadoop Normally Hadoop runs on a dedicated cluster. In that case, the setup is a bit more complex than for the pseudo-distributed case. Specify hostname or IP address of the master server in the values for fs.default.name in core-site.xml and and mapred.job.tracker in mapred-site.xml file. These are specified as host:port pairs. The default ports are 9000 and Specify directories for dfs.name.dir and dfs.data.dir and dfs.replication in conf/hdfs-site.xml. These are used to hold distributed file system data on the master node and slave nodes respectively. Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices. Specify mapred.system.dir and mapred.local.dir in conf/hadoop-site.xml. The system directory must be accessible by server and clients. The local directory determines where temporary MapReduce data is written. It also may be a list of directories. 45/53

46 Fully Distributed Hadoop (contd.) Specify mapred.map.tasks (default value: 2) and mapred.reduce.tasks (default value: 1) in conf/mapred-site.xml.this is suitable for local or pseudo-distributed mode only. Choosing the right number of map and reduce tasks has significant effects on performance. Default Java memory size is 1000MB. This can be changed in conf/hadoop-env.sh. This is related to the parameters discussed above. See Chapter 9 in the Hadoop book by Tom White for further discussion. List all slave host names or IP addresses in your conf/slaves file, one per line. List name of master node in conf/master. 46/53

47 Sample Config Files Sample core-site.xml file. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://node00:9000</value> <description>the name of the default filesystem.</description> </property> </configuration> Sample hdfs-site.xml file. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name><value>/tmp/hadoop-amit/hdfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/tmp/hadoop-amit/hdfs/data</value> </property> </configuration> 47/53

48 Sample Config Files (contd.) Sample mapred-site.xml file. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapred.job.tracker</name> <value>hdfs://node00:9001</value> </property> <property> <name>mapred.system.dir</name> <value>/tmp/hadoop-amit/mapred/system</value> </property> <property> <name>mapred.local.dir</name> <value>/tmp/hadoop-amit/mapred/local</value> </property> <property> <name>mapred.map.tasks.maximum</name> <value>17</value> <description> The maximum number of map tasks which are run simultaneously on a given TaskTracker individually. This should be a prime number larger than multiple of the number of slave hosts. </description> </property> <property> <name>mapred.reduce.tasks.maximum</name> <value>16</value> </property> <property> <name>mapred.reduce.tasks</name> <value>7</value> </property> </configuration> 48/53

49 Sample Config Files (contd.) Sample hadoop-env.sh file. Only need two things to be defined here. # Set Hadoop-specific environment variables here. # The java implementation to use. Required. export JAVA_HOME=/usr/java/default... # The maximum amount of heap to use, in MB. Default is # export HADOOP_HEAPSIZE= /53

50 Running multiple copies of Hadoop on Onyx Cluster Normally only one copy of Hadoop runs on a cluster. In our lab setup, we want to be able to run multiple copies of Hadoop where the namenode (node00) is overloaded but each user has unique datanodes that they schedule via PBS. In order to do this, each user needs unique ports for the following: property default value config file fs.default.name hdfs://node00:9000 core-site.xml mapred.job.tracker hdfs://node00:9001 mapred-site.xml mapred.job.tracker.http.address :50030 mapred-site.xml dfs.datanode.address :50010 hdfs-site.xml dfs.datanode.ipc.address :50020 hdfs-site.xml dfs.http.address :50070 hdfs-site.xml dfs.datanode.http..address :50075 hdfs-site.xml dfs.secondary.http.address :50090 hdfs-site.xml We have provided a script cluster-pickports.sh that lets the user pick a base port and then it sets the three config files with numbers. So with base port of 60000, the eight ports get set to 60000, 60001, 60010, 60020, 60030, 60070, and The script is available via the class examples code repository at: lab/hadoop/local-scripts 50/53

51 Reference: How to use Hadoop in the lab (part 1) Download a stable version of Hadoop (version 1.2.1) or on onyx simply copy it from my folder as follows: cd ~ mkdir ~/hadoop-install cd hadoop-install/ cp ~amit/cs430/hadoop tar.gz. tar xzvf hadoop tar.gz ln -s hadoop hadoop cp ~amit/hadoop-install/local-scripts. cp ~amit/hadoop-install/myexamples. cd ~amit/hadoop-install/local-scripts svn update I also recommend adding the Hadoop bin folder to your path. For example, add the following to the end of your ~/.bashrc file and then login again (or use the command source ~/.bashrc) export PATH=~/hadoop-install/hadoop/bin:$PATH Then setup the configuration files for distributed mode of operation and modify configuration file based on your group s assigned port range (check the teams page on the class website) using the following script. cd ~/hadoop-install/local-scripts setmod.sh distributed cluster-pickports.sh <base-port> Running Hadoop in pseudo-distributed mode in the lab isn t recommended because of potential port conflicts with the real distributed Hadoop. 51/53

52 Reference: How to use Hadoop in the lab (part 2) Login to onyx via ssh to create the first console. Open a second window/console on onyx to grab some nodes from PBS. In the original window on onyx, create the cluster using the provided script. Then check on the status of the cluster. cd ~/hadoop-install/local-scripts create-hadoop-cluster.sh cd../hadoop...wait for a few seconds... bin/hadoop dfsadmin -report After you are done running jobs and you have copied the relevant output from HDFS to your local file system, destroy the Hadoop distributed file system as follows: cd ~/hadoop-install/local-scripts cluster-remove.sh Finally, exit from the PBS shell in the second console to release the nodes. 52/53

53 References MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google Inc. Appeared in: OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, Hadoop: An open source implementation of MapReduce. The main website: Hadoop: The Definitive Guide. Tom White, June 2012, O Reilly. MapReduce tutorial at Yahoo. Data Clustering using MapReduce by Makho Ngazimbi (supervised by Amit Jain). Masters in Computer Science project report, Boise State University, Hadoop and Hive as Scalable alternatives to RDBMS: A Case Study by Marissa Hollingsworth (supervised by Amit Jain). Masters in Computer Science project report, Boise State University, /53

Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Mrs: MapReduce for Scientific Computing in Python

Mrs: MapReduce for Scientific Computing in Python Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Hadoop Configuration and First Examples

Hadoop Configuration and First Examples Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Hadoop (pseudo-distributed) installation and configuration

Hadoop (pseudo-distributed) installation and configuration Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under

More information

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so: Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra) MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins) Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --

More information

Single Node Setup. Table of contents

Single Node Setup. Table of contents Table of contents 1 Purpose... 2 2 Prerequisites...2 2.1 Supported Platforms...2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster... 3 5 Standalone

More information

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free) Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming

More information

MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

More information

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science

More information

Introduction To Hadoop

Introduction To Hadoop Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE

More information

Biomap Jobs and the Big Picture

Biomap Jobs and the Big Picture Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm. hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.br Outline 1 Introduction 2 MapReduce 3 Hadoop 4 How to Install

More information

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

2.1 Hadoop a. Hadoop Installation & Configuration

2.1 Hadoop a. Hadoop Installation & Configuration 2. Implementation 2.1 Hadoop a. Hadoop Installation & Configuration First of all, we need to install Java Sun 6, and it is preferred to be version 6 not 7 for running Hadoop. Type the following commands

More information

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Hadoop & Pig Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Outline Introduction (Setup) Hadoop, HDFS and MapReduce Pig Introduction What is Hadoop and where did it come from? Big Data

More information

Single Node Hadoop Cluster Setup

Single Node Hadoop Cluster Setup Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos) Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

More information

Running Kmeans Mapreduce code on Amazon AWS

Running Kmeans Mapreduce code on Amazon AWS Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1 102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create

More information

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap And Garrett Poppe Topics Create hadoopuser and group Edit sudoers Set up SSH Install JDK Install Hadoop Editting

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone

More information

Big Data Analytics* Outline. Issues. Big Data

Big Data Analytics* Outline. Issues. Big Data Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008 Programming Hadoop Map-Reduce Programming, Tuning & Debugging Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008 Existential angst: Who am I? Yahoo! Grid Team (CCDI) Apache Hadoop Developer

More information

HADOOP - MULTI NODE CLUSTER

HADOOP - MULTI NODE CLUSTER HADOOP - MULTI NODE CLUSTER http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm Copyright tutorialspoint.com This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed

More information

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure

More information

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/ Download the Hadoop tar. Download the Java from Oracle - Unpack the Comparisons -- $tar -zxvf hadoop-2.6.0.tar.gz $tar -zxf jdk1.7.0_60.tar.gz Set JAVA PATH in Linux Environment. Edit.bashrc and add below

More information

Running Hadoop on Windows CCNP Server

Running Hadoop on Windows CCNP Server Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs

More information

Download and install Download virtual machine Import virtual machine in Virtualbox

Download and install Download virtual machine Import virtual machine in Virtualbox Hadoop/Pig Install Download and install Virtualbox www.virtualbox.org Virtualbox Extension Pack Download virtual machine link in schedule (https://rmacchpcsymposium2015.sched.org/? iframe=no) Import virtual

More information

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

More information

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

More information

HADOOP CLUSTER SETUP GUIDE:

HADOOP CLUSTER SETUP GUIDE: HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do

More information

What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin

What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin data every day 5/5/2016 Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Programming in Hadoop Programming, Tuning & Debugging

Programming in Hadoop Programming, Tuning & Debugging Programming in Hadoop Programming, Tuning & Debugging Venkatesh. S. Cloud Computing and Data Infrastructure Yahoo! Bangalore (India) Agenda Hadoop MapReduce Programming Distributed File System HoD Provisioning

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights

More information

BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader. Arijit Das Greg Belli Erik Lowney Nick Bitto

BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader. Arijit Das Greg Belli Erik Lowney Nick Bitto BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader Arijit Das Greg Belli Erik Lowney Nick Bitto Introduction NPS Introduction Hadoop File System Background Wordcount & modifications

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

19 Putting into Practice: Large-Scale Data Management with HADOOP

19 Putting into Practice: Large-Scale Data Management with HADOOP 19 Putting into Practice: Large-Scale Data Management with HADOOP The chapter proposes an introduction to HADOOP and suggests some exercises to initiate a practical experience of the system. The following

More information

By Hrudaya nath K Cloud Computing

By Hrudaya nath K Cloud Computing Processing Big Data with Map Reduce and HDFS By Hrudaya nath K Cloud Computing Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution of

More information

Map-Reduce and Hadoop

Map-Reduce and Hadoop Map-Reduce and Hadoop 1 Introduction to Map-Reduce 2 3 Map Reduce operations Input data are (key, value) pairs 2 operations available : map and reduce Map Takes a (key, value) and generates other (key,

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. 1) Start up 3

More information

Hadoop + Clojure. Hadoop World NYC Friday, October 2, 2009. Stuart Sierra, AltLaw.org

Hadoop + Clojure. Hadoop World NYC Friday, October 2, 2009. Stuart Sierra, AltLaw.org Hadoop + Clojure Hadoop World NYC Friday, October 2, 2009 Stuart Sierra, AltLaw.org JVM Languages Functional Object Oriented Native to the JVM Clojure Scala Groovy Ported to the JVM Armed Bear CL Kawa

More information

INTRODUCTION TO HADOOP

INTRODUCTION TO HADOOP Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)

More information

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G... Go Home About Contact Blog Code Publications DMOZ100k06 Photography Running Hadoop On Ubuntu Linux (Multi-Node Cluster) From Michael G. Noll Contents 1 What we want to do 2 Tutorial approach and structure

More information

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi Building a distributed search system with Apache Hadoop and Lucene Mirko Calvaresi a Barbara, Leonardo e Vittoria 2 Index Preface... 5 1 Introduction: the Big Data Problem... 6 1.1 Big data: handling the

More information

Distributed Systems + Middleware Hadoop

Distributed Systems + Middleware Hadoop Distributed Systems + Middleware Hadoop Alessandro Sivieri Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico, Italy alessandro.sivieri@polimi.it http://corsi.dei.polimi.it/distsys Contents

More information

Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab

Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab Hadoop Overview Lavanya Ramakrishnan Iwona Sakrejda Shane Canon Lawrence Berkeley National Lab July 2011 Overview Concepts & Background MapReduce and Hadoop Hadoop Ecosystem Tools on top of Hadoop Hadoop

More information

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information