hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

Size: px

Start display at page:

Download "hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm."

Charlotte Boone
8 years ago
Views:

1 hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.

2 Outline 1 Introduction 2 MapReduce 3 Hadoop 4 How to Install Hadoop? 5 How to Configure Hadoop? 6 How to Start Hadoop? 7 How to Run Hadoop Applications? 8 How to Use Hadoop Environment on Grid'5000? 9 How to Develop Hadoop Applications? 10 Read More 2/25

3 1 - Introduction Main goal: Introduce the development of MapReduce applications using the Hadoop framework. Important: Prepare a Hadoop environment. Grid'5000 Hadoop environment available. 3/25

Important: Prepare a Hadoop environment.

4 2 - MapReduce Programming model. Proposed by Google in Based on LISP map and reduce functions. Uses the parallelism to share the data load, instead of parallelizing processing loads. 4/25

5 2 - MapReduce MapReduce data flow example: input map shuffle reduce ouput 5/25

6 3 - Hadoop Set of sub-projects. Pig Chukwa Hive HBase MapReduce HDFS ZooKeeper Core Avro Yahoo!'s MapReduce implementation. Free and open-source framework. 6/25

7 3 - Hadoop Split = piece of input Lorem ipsum dolor sit amet Lisbon 20 Information = <key, value> pairs <0, Lorem ipsum dolor sit amet> <Lisbon, 20> Task = part of the work (maptask or reducetask) Job = entire work 7/25

8 3 - Hadoop Input > Map Shuffle Reduce > Output 8/25

9 3 - Hadoop Madrid Lisbon Lisbon,1 Madrid,1 Lisbon,1 Lisbon,1 London,1 Lisbon, <1, 1> London, <1> Lisbon, 2 London, 1 Lisbon Paris Lisbon,1 Paris,1 Madrid,1 Paris,1 Paris,1 Madrid, <1> Paris, <1, 1> Madrid, 1 Paris, 2 Paris London London,1 Paris,1 Input > Map Shuffle Reduce > Output 9/25

Paris,1 Madrid,1 Paris,1 Paris,1 Madrid, <1> Paris, <1, 1> Madrid, 1

10 3 - Hadoop Client JobTracker HDFS TaskTracker TaskTracker TaskTracker 10/25

11 4 How to Install Hadoop? Install Java 1.6.XX. Configure SSH to works based on RSA or DSA key authentication method. Download a Hadoop version. Unzip the files in some folder, e. g. $PATH = /opt/hadoop/. Configure the JAVA_HOME property in hadoopenv.sh file, located in $PATH/conf/ folder. From: # export JAVA_HOME=/usr/lib/j2sdk1.5-sun To: export JAVA_HOME=/usr/lib/jvm/java-6-sun 11/25

Unzip the files in some folder, e. g. $PATH = /opt/hadoop/.

12 5 How to Configure Hadoop? masters slaves core site mapred site hdfs site $PATH/conf/ 12/25

13 5 How to Configure Hadoop? masters slaves core site mapred site hdfs site node01.site.grid5000.fr 13/25

14 5 How to Configure Hadoop? masters slaves core site mapred site hdfs site node01.site.grid5000.fr node02.site.grid5000.fr node03.site.grid5000.fr nodenn.site.grid5000.fr 14/25

15 5 How to Configure Hadoop? masters slaves core site mapred site hdfs site <?xml version="1.0"?> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-\${user.name}</value> </property> <property> <name>fs.default.name</name> <value>hdfs://node01.site.grid5000.fr:54310</value> </property> </configuration> 15/25

$dir</name> <value>/tmp/hadoop-\${user.$

16 5 How to Configure Hadoop? masters slaves core site mapred site hdfs site <?xml version="1.0"?> <configuration> <property> <name>mapred.job.tracker</name> <value>hdfs://node01.site.grid5000.fr:54311</value> </property> </configuration> 16/25

17 5 How to Configure Hadoop? masters slaves core site mapred site hdfs site <?xml version="1.0"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> 17/25

18 6 How to Start Hadoop? Connect to the Master node: ssh Stop all the current HDFS and Hadoop MapReduce instances: $PATH/bin/stop-all.sh Format the HDFS namenode: $PATH/bin/hadoop namenode -format Initialize the HDFS: $PATH/bin/start-dfs.sh Initialize the Hadoop MapReduce: $PATH/bin/start-mapred.sh 18/25

sh Format the HDFS namenode: $PATH/bin/hadoop namenode -format Initialize the

19 7 How to Run Hadoop Applications? Example of a generic call: $PATH/bin/hadoop jar file.jar [<parameters>] Example of a real call: $PATH/bin/hadoop jar \ $PATH/hadoop examples.jar \ Pi /25

$$PATH/bin/hadoop jar \ $PATH/hadoop-0.20.$

20 7 How to Run Hadoop Applications? Some important HDFS commands: $PATH/bin/hadoop dfs -ls [foldername] $PATH/bin/hadoop dfs -rm [filename] $PATH/bin/hadoop dfs -mkdir [foldername] $PATH/bin/hadoop dfs -copyfromlocal [filelocal] [filehdfs] $PATH/bin/hadoop dfs -copytolocal [filehdfs] [filelocal] 20/25

$PATH/bin/hadoop dfs -rm [filename] $PATH/bin/hadoop dfs -mkdir

21 8 How to Use Hadoop Environment on Grid'5000? Allocate, with OAR, the quantity of nodes you will need for the Hadoop job: oarsub -I -t deploy -l nodes=num_hosts,walltime=hh:mm:ss Deploy the Hadoop environment and run the script at the nodes allocated for the job. kadeploy3 \ -a ~vvielmocogo/hadoop/0.20.1/lenny-x64-nfs-hadoop.dsc3 \ -f $OAR_FILE_NODES \ -k ~/.ssh/id_rsa.pub \ -s ~vvielmocogo/hadoop/0.20.1/config.sh 21/25

22 9 How to Develop Hadoop Applications? Examples are in the folder $PATH/src/examples/org/apache/hadoop/examples/ What do you need to program? public void map(type key, Type value, Context context) throws IOException, InterruptedException { // map code } public void reduce(type key, Type value, Context context) throws IOException, InterruptedException { // reduce code } 22/25

23 9 How to Develop Hadoop Applications? WordCount.java public void map(object key, Text value, Context context) throws IOException, InterruptedException { IntWritable one = new IntWritable(1); Text word = new Text(); StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { IntWritable result = new IntWritable(); int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } 23/25

24 9 How to Develop Hadoop Applications? Exercise 1: Copy to HDFS one file $PATH/bin/hadoop dfs -copyfromlocal \ ~vvielmocogo/hadoop0.20.1/gutenberg/ \ gutenberg For each word in input, returns the number of the line that have the bigger quantity of words. Exercise 2: For each word in input, returns the list of lines which contents the word, just like an index. P.S.1: To add a new exercise in hadoop-examples JAR, do you have to create a new Java file in examples folder add it's class in ExampleDriver.java file. P.S.2: To generate the hadoop-examples JAR, use: ant -Doffline=true examples 24/25

25 10 Read More 25/25

26 Extras Grid'5000 Hadoop Environment: Deploy lenny-x64-nfs Extract the Hadoop files and configure $JAVA_HOME Create a new environment (tgz-g5k) Create the descriptor file Create the script to configure the environment: Fills masters and slaves files based on $OAR_FILE_NODES Fills others configuration XML files Copy configuration files for all nodes Startup Hadoop on master node 26/25

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually