Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 There are three ways of installing Hadoop: Standalone (or local) mode: no deamons running. Nothing to configure after downloading and installing Hadoop. Pseudo-distributed mode: daemons run on the local machine, thus simulating a cluster on a small scale. Fully distributed mode: in a real cluster. The easy way to install and try Hadoop is on a single machine. In this guide, Hadoop is installed on a BioNode (http://www.wurlug.org/wurlug/index.php/bionode) image under VirtualBox in pseudo-distributed mode. BioNode is an open source software initiative to provide a scalable Linux virtual machine image for bioinformatics, based on Debian Linux and Bio Med packages. The image can be deployed on the desktop, using VirtualBox, on a number of networked PCs, and in the cloud (currently Amazon EC2 is supported). Prerequisites Hadoop supported platforms are GNU/Linux for development and production platform and Win32 is supported only as a development platform. For the purpose of this document, GNU/Linux (Bionode image) is assumed. Java 1.6.x, preferably from SUN, must be installed. Also ssh must be installed and sshd must be running. A legacy application like a C program or even a bash script (or Perl, Ruby, Python script) to run under Hadoop. Finally, some bash scripting background. Java JDK 6 de SUN Install sun-java6-jdk package (and its dependences) from Synaptic or from the command line: $ sudo apt-get install sun-java6-jdk Select Sun s Java as the default on your machine: $ sudo update-java-alternatives -s java-6-sun For testing wheter Sun s JDK is correctly set up: 1/16

$ java -version Creating a hadoop user 1) Create a hadoop user as root. $ su - Password: ******* # useradd -U -m -s /bin/bash hadoop # passwd hadoop Enter new UNIX password: ****** Retype new UNIX password: ****** 2) Before you can run Hadoop, you need to tell it where Java is located on your system. Check for JAVA_HOME system variable and if it is undefined, set it in the.bashrc file for the hadoop user. # su - hadoop $ pwd /home/hadoop $ vim.bashrc At the end of the file, add the following line: export JAVA_HOME=/usr/lib/jvm/java-6-sun Configuring SSH Hadoop requires SSH access to manage its nodes, even if you want to use Hadoop on your local machine. In our case, it is necessary to configure SSH access to localhost for the hadoop user to log in without password. As hadoop user, do the following: $ ssh-keygen -t rsa -P $ cd.ssh $ cat id_rsa.pub >> authorized_keys Download and install Hadoop software 1) Download Hadoop from a mirror site, i.e. http://apache.rediris.es/hadoop/core/. Get into the stable directory. At the time of writing this guide, current stable release is hadoop-0.20.2. 2) Untar on /home/hadoop (or on your favorite path). 2/16

$ tar xvfz hadoop-0.20.2.tar.gz 3) Set HADOOP_HOME environment variable and add the Hadoop bin directory to the PATH. $ vim.bashrc export HADOOP_HOME=/home/hadoop/hadoop-0.20.2 export PATH=$PATH:$HADOOP_HOME/bin 4) Edit file $HADOOP_HOME/conf/hadoop-env.sh and also set the correct JAVA_HOME path. Following this guide, it would be: export JAVA_HOME=/usr/lib/jvm/java-6-sun 5) Test hadoop installation. $ hadoop version Hadoop 0.20.2 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 - r 911707 Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010 Disabling IPv6 In some cases (Debian-based Linux distributions like Ubuntu), using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses. In this single-node installation, it is recommended to disable IPv6 which can be done editing $HADOOP_HOME/conf/hadoop-env.sh again and adding: export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true Another more persistent way of doing this is editing /etc/sysctl.conf and adding the following lines at the end of the file: $ sudo vim /etc/sysctl.conf #disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 It is necessary to reboot the machine to make the changes take effect. And just another one way to disable IPv6 is adding a blacklist in /etc/modprobe.d/ blacklist file: # disable IPv6 3/16

blacklist ipv6 Configuration files For single-node cluster set up, the following files must be configured: core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop_tmp</value> <description>a base directory for other temporary files</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a FileSystem.</description> </property> </configuration> hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>default block replication. The actual number of replications can be specified when the file is created. The default (3) is used if replication is not specified in create time. </description> </property> 4/16

</configuration> mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>  <property> <name>mapred.child.java.opts</name> <value>-xmx500m</value> </property> </configuration> Note that both masters and slaves files in $HADOOP_HOME/conf are localhost. Formatting the name node The first step to starting up Hadoop installation is formatting the Hadoop Distributed Filesystem (HDFS) which is implemented on top of the local filesystem of our single-node cluster: $ hadoop namenode -format 11/07/23 07:31:26 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = bionode32-0/127.0.0.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/ branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 11/07/23 07:31:26 INFO namenode.fsnamesystem: fsowner=hadoop,hadoop 11/07/23 07:31:26 INFO namenode.fsnamesystem: supergroup=supergroup 11/07/23 07:31:26 INFO namenode.fsnamesystem: ispermissionenabled=true 11/07/23 07:31:27 INFO common.storage: Image file of size 96 saved in 0 seconds. 5/16

11/07/23 07:31:27 INFO common.storage: Storage directory /tmp/hadoop_tmp/dfs/ name has been successfully formatted. 11/07/23 07:31:27 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at bionode32-0/127.0.0.1 ************************************************************/ Starting daemons For starting this single-node cluster, run the command: $ start-all.sh starting namenode, logging to /home/hadoop/hadoop-0.20.2/bin/../logs/hadoophadoop-namenode-bionode32-0.out localhost: starting datanode, logging to /home/hadoop/hadoop-0.20.2/bin/../logs/ hadoop-hadoop-datanode-bionode32-0.out localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20.2/ bin/../logs/hadoop-hadoop-secondarynamenode-bionode32-0.out starting jobtracker, logging to /home/hadoop/hadoop-0.20.2/bin/../logs/hadoophadoop-jobtracker-bionode32-0.out localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20.2/bin/../ logs/hadoop-hadoop-tasktracker-bionode32-0.out We can check whether the expected Hadoop processes are running with jps (part of Sun s Java since v1.5.0): $ jps 3387 NameNode 3581 SecondaryNameNode 3742 TaskTracker 3804 Jps 3655 JobTracker 3485 DataNode Hadoop comes with several web interfaces which are by default available at these locations: http://localhost:50030/ - web UI for MapReduce job tracker(s) http://localhost:50060/ - web UI for task tracker(s) http://localhost:50070/ - web UI for HDFS name node(s) If NameNode doesn t start, check logs directory ($HADOOP_HOME/logs). If in any situation you reach an inconsistent state for HDFS, as we are testing Hadoop, stop daemons, remover /tmp/hadoop_tmp, create it again and start daemons. Hadoop has a filesystem checking utility: $ hadoop fsck / -files -blocks For more help, just type: 6/16

$ hadoop fsck Running examples $ cd $ cd hadoop-0.20.2 $ hadoop jar hadoop-0.20.2-examples.jar pi 20 1000 $ hadoop jar hadoop-0.20.2-examples.jar wordcount /input /output Stopping daemons $ stop-all.sh 7/16

Hadoop Streaming Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to develop your MapReduce program. Let s illustrate this by executing a MapReduce program written in bash. For simplicity, we are not interested now in the reduce phase. Let s concentrate for now only in the map phase. First, I ll create my own map program called mymap_hdfs.sh. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 #!/usr/bin/env bash # mymap_hdfs.sh: example MapReduce program written in bash using HDFS. # Adds numbers contained in text files. # Written by Juan Lago # NLineInputFormat gives a single line: key is offset, value is file to process read offset file local_file=`basename $file` log_file=$local_file.log output_file=$local_file.out touch $log_file echo "Reading ($offset) and ($file)" >> $log_file # Retrieve file from storage to local disk echo "Retrieving $file from HDFS" >> $log_file echo "$HADOOP_HOME/bin/hadoop fs -copytolocal /$file." >> $log_file $HADOOP_HOME/bin/hadoop fs -copytolocal /$file. echo "Directory content:" >> $log_file ls -la >> $log_file echo "---" >> $log_file echo "File ($local_file) content:" >> $log_file cat $local_file >> $log_file echo "---" >> $log_file # Set initial value to variable sum=0 # Do some operations with file for WORD in `cat $local_file` do echo "Adding $WORD" >> $log_file sum=$(($sum + $WORD)) 8/16

38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 done echo "Creating ($output_file) with result ($sum) from file (/$file)" >> $log_file echo $sum > $output_file echo "This is the sum of $local_file content" >> $output_file # Put files generated into HDFS echo "Put output and log files to HDFS" >> $log_file $HADOOP_HOME/bin/hadoop fs -copyfromlocal./$output_file /output/$output_file $HADOOP_HOME/bin/hadoop fs -copyfromlocal./$log_file /output/$log_file # Remove local copy and temporary files (output and log files) rm $local_file.* # Write to standard output echo $sum # Script report status exit 0 This program takes a text file as input (standard input), adds all the numbers it contents and writes the result to an output file with the same name that the input file plus the.out extension. Number delimiters can be white spaces or line breaks. A log file is also created. In a moment, we will understand how Hadoop sends the file to be processed to our map program through standard input. For the moment, trust this line: # NLineInputFormat gives a single line: key is offset, value is file to process read offset file Second, we need some data to process. Let s suppose that we have a directory called input_data that contents all the files we want to process in a MapReduce manner. For instance: $ ls -la input_data -rw-r--r-- 1 root root -rw-r--r-- 1 root root -rw-r--r-- 1 root root -rw-r--r-- 1 root root -rw-r--r-- 1 root root -rw-r--r-- 1 root root 10 Jul 14 02:44 data_1 24 Jul 14 02:45 data_2 15 Jul 14 03:15 data_3 40 Jul 14 06:21 data_4 29 Jul 14 06:21 data_5 20 Jul 14 06:31 data_6 Finally, Hadoop Streaming will create an output directory with the results of the map phase, thus, if really exists in HDFS, this directory must be removed. To run Hadoop Streaming, I ll create a bash script called run_hdfs.sh like this: 9/16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # run_hdfs.sh: run Hadoop Streaming with mymap_hdfs.sh script using HDFS # Written by Juan Lago hadoop fs -test -e /output if [ $? -eq 0 ] ; then echo "Removing output directory from HDFS..." hadoop fs -rmr /output fi hadoop fs -test -e /input_data if [ $? -eq 1 ] ; then echo "Uploading data to HDFS..." find./input_data -type f > file_list.txt hadoop fs -copyfromlocal./input_data /input_data hadoop fs -copyfromlocal./file_list.txt / fi echo "Checking data..." hadoop fs -ls /input_data hadoop fs -cat /file_list.txt echo "Running Hadoop Streaming..." hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \! -D mapred.reduce.tasks=0 \! -D mapred.map.tasks.speculative.execution=false \! -D mapred.task.timeout=60000 \! -input /file_list.txt \! -inputformat org.apache.hadoop.mapred.lib.nlineinputformat \! -output /output \! -mapper mymap_hdfs.sh \! -file mymap_hdfs.sh The files to be processed is indicated in the -input parameter. We need to generate this file from de input_data directory. This can be done by: find./input_data -type f > file_list.txt After this command, file_list.txt has the following content:./input_data/data_2./input_data/data_1./input_data/data_5./input_data/data_4./input_data/data_3./input_data/data_6 10/16

Hadoop can process many different types of data formats, from flat text files to databases. Files are splitted into chunks of data that is processed by a single map. Each split is divided into records and the map processes each record (a key-value pair) in turn. An InputFormat is responsible for creating the input splits and dividing them into records. The -inputformat parameter indicates which class to use to process the input data. In this case, I have used org.apache.hadoop.mapred.lib.nlineinputformat. N refers to the numbers of lines of input that each mapper receives. With N set to one (the default), each mapper receives exactly one line of input as the value. The key value is the byte offset of each line within the file. The mapred.line.input.format.linespermap property controls the value of N. In the above example, the key-value pairs emitted would be: (0,./input_data/data_2) (20,./input_data/data_1) (40,./input_data/data_5) (60,./input_data/data_4) (80,./input_data/data_3) (100,./input_data/data_6) This matches with the previous standard input read in the mymap_hdfs.sh script. read offset file Each map program instance will receive a different line (file partition or chunk). So the first map process will receive 0 as offset and./input_data/data_2 as value, the second map process will receive 20 as offset and./input_data/data_1 as value, and so on. As we are not interesting in the reduce phase, the mapred.reduce.tasks parameter must bet set to 0. Speculative execution is the way that Hadoop fights against slow-running tasks. Hadoop tries to detect when a task is running slower than expected and launches another, equivalent, task as a backup. Speculative execution is turned on by default, so for this example, I ll turn it off (mapred.map.tasks.speculative.execution=false). The task timeout (mapred.task.timeout) was set to 60 seconds (60.000 miliseconds) so that Hadoop didn t kill tasks that are taking a long time. Testing this map program will be done by running the run_hdfs.sh script: $./run_hdfs.sh When the job is done, output files can be accessed through HDFS command line: $ hadoop fs -ls /output Found 19 items drwxr-xr-x - hadoop supergroup 0 2011-07-23 12:41 /output/_logs -rw-r--r-- 1 hadoop supergroup 1356 2011-07-23 12:42 /output/data_1.log 11/16

-rw-r--r-- 1 hadoop supergroup 37 2011-07-23 12:42 /output/data_1.out -rw-r--r-- 1 hadoop supergroup 1405 2011-07-23 12:42 /output/data_2.log -rw-r--r-- 1 hadoop supergroup 38 2011-07-23 12:41 /output/data_2.out -rw-r--r-- 1 hadoop supergroup 1380 2011-07-23 12:42 /output/data_3.log -rw-r--r-- 1 hadoop supergroup 37 2011-07-23 12:42 /output/data_3.out -rw-r--r-- 1 hadoop supergroup 1441 2011-07-23 12:42 /output/data_4.log -rw-r--r-- 1 hadoop supergroup 41 2011-07-23 12:42 /output/data_4.out -rw-r--r-- 1 hadoop supergroup 1410 2011-07-23 12:42 /output/data_5.log -rw-r--r-- 1 hadoop supergroup 39 2011-07-23 12:42 /output/data_5.out -rw-r--r-- 1 hadoop supergroup 1412 2011-07-23 12:42 /output/data_6.log -rw-r--r-- 1 hadoop supergroup 37 2011-07-23 12:42 /output/data_6.out -rw-r--r-- 1 hadoop supergroup 5 2011-07-23 12:41 /output/part-00000 -rw-r--r-- 1 hadoop supergroup 4 2011-07-23 12:41 /output/part-00001 -rw-r--r-- 1 hadoop supergroup 6 2011-07-23 12:42 /output/part-00002 -rw-r--r-- 1 hadoop supergroup 8 2011-07-23 12:42 /output/part-00003 -rw-r--r-- 1 hadoop supergroup 4 2011-07-23 12:42 /output/part-00004 -rw-r--r-- 1 hadoop supergroup 4 2011-07-23 12:42 /output/part-00005 Files ended with.out and.log are generated by the map program (see mymap_hdfs.sh) and files that begins with part- are the results of the Hadoop Streaming execution (the sum of each file, in this case) and come from our map standard output. If we were in a cluster configuration, this execution will be in parallel (each node of the cluster will execute several map phases). 12/16

A more realistic example Now we are going to try another example. In this case, we have a program written in C that looks for similarities between two nucleotide (or peptide) sequences. The program is organized in several executables: mapaf: takes two sequences to compare (identity score matrix with no gaps) and splits them in N and M parts or fragments respectively. Then it creates a script for running all the comparisons defined by the N x M fragments. These are the parameters: Parameter <app_compare> <seq1> <seq2> <splits_1> <splits_2> <prefix_out> <prefix_map> <score> <app_reduce> Value Name of the application that will find identities between two fragments (one of sequence 1 and one of sequence 2). First sequence to compare. Second sequence to compare. Number of parts/fragments the sequence 1 is divided into. Number of parts/fragments the sequence 2 is divided into. The app_compare application will create output files with this prefix. For instance, if this parameter is bin, output files will be of the form bin-0-0 (the result of the comparison of fragment 0 of sequence 1 with fragment 0 of sequence 2), bin-0-1,... mapaf will create two bash scripts (one for map and one for reduce) for running the comparison of all sequence parts. Minimum length of the identity subsequences we are interested during comparison. Name of the reduce application for running the reduce phase. AllFragv3: the application for comparing sequence fragments. It is used as the first parameter of mapaf (<app_compare>). Parameter Value <fragment_seq1> First fragment to compare (sequence 1). <fragment_seq2> Second fragment to compare (sequence 2). <output_file> Name of the output file. For example, bin-0-0, bin-1-0,... <score> Minimum length of the identity subsequences we are interested during comparison. An example of how to use the mapaf application: 13/16

$./mapaf./allfragv3 seq1.fa seq2.fa 2 3 bin script 100./RedAFv3 In this case, mapaf will create 2 splits for first sequence and 3 for second sequence: $ ls -la seq* -rw-r--r-- 1 hadoop hadoop 41 Jul 20 19:03 seq1.fa -rw-r--r-- 1 hadoop hadoop 31 Jul 23 13:17 seq1.fa-0 -rw-r--r-- 1 hadoop hadoop 32 Jul 23 13:17 seq1.fa-1 -rw-r--r-- 1 hadoop hadoop 45 Jul 20 19:03 seq2.fa -rw-r--r-- 1 hadoop hadoop 28 Jul 23 13:17 seq2.fa-0 -rw-r--r-- 1 hadoop hadoop 29 Jul 23 13:17 seq2.fa-1 -rw-r--r-- 1 hadoop hadoop 29 Jul 23 13:17 seq2.fa-2 Map and reduce scripts generated are: $ cat script-map./allfragv3 seq1.fa-0 seq2.fa-0 bin-0-0 2./AllFragv3 seq1.fa-0 seq2.fa-1 bin-0-1 2./AllFragv3 seq1.fa-0 seq2.fa-2 bin-0-2 2./AllFragv3 seq1.fa-1 seq2.fa-0 bin-1-0 2./AllFragv3 seq1.fa-1 seq2.fa-1 bin-1-1 2./AllFragv3 seq1.fa-1 seq2.fa-2 bin-1-2 2 $ cat script-red./redafv3 bin 2 3 Now, we are going to see how to parallelize this application within Hadoop. 14/16

Parallelizing with Hadoop run_dotplot.sh 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 # run_dotplot.sh: run Hadoop Streaming with mymap_dotplot.sh program using HDFS # Written by Juan Lago hadoop fs -test -e /dotplot_output if [ $? -eq 0 ] ; then echo "Removing output directory from HDFS..." hadoop fs -rmr /dotplot_output fi hadoop fs -test -e /dotplot_input_data if [ $? -eq 0 ] ; then echo "Removing input directory from HDFS..." hadoop fs -rmr /dotplot_input_data fi echo "Preparing data for processing..."./ejemapaf hadoop fs -mkdir /dotplot_input_data $HADOOP_HOME/bin/hadoop fs -put script-map /dotplot_input_data/script-map $HADOOP_HOME/bin/hadoop fs -cat /dotplot_input_data/script-map $HADOOP_HOME/bin/hadoop fs -put seq1.fa-* /dotplot_input_data $HADOOP_HOME/bin/hadoop fs -put seq2.fa-* /dotplot_input_data $HADOOP_HOME/bin/hadoop fs -ls /dotplot_input_data read -p "Press ENTER to continue." echo "Running Hadoop Streaming..." hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D mapred.reduce.tasks=0 \ -D mapred.map.tasks.speculative.execution=false \ -D mapred.task.timeout=120000 \ -input /dotplot_input_data/script-map \ -inputformat org.apache.hadoop.mapred.lib.nlineinputformat \ -output /dotplot_output \ -mapper mapdotplot.sh \ -file mapdotplot.sh \ -file AllFragv3 echo "Downloading files from /dotplot_output" rm -fr./output mkdir./output hadoop fs -copytolocal /dotplot_output/*./output echo "Hadoop Streaming done!!" 15/16

mapdotplot.sh 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 #!/usr/bin/env bash # mapdotplot.sh: running a C program using Hadoop Streaming and HDFS. # Written by Juan Lago # NLineInputFormat gives a single line: key is file offset, value is each line in script-map # Lines expected as the examples below: #./AllFragv3 seq1.fa-0 seq2.fa-0 bin-0-0 2 #./AllFragv3 seq1.fa-0 seq2.fa-1 bin-0-1 2 #./AllFragv3 seq1.fa-0 seq2.fa-2 bin-0-2 2 #... read offset CMD # Splits line into individual words (white space delimiter) set $CMD app=$1 seq1=$2 seq2=$3 output_file=$4 score=$5 echo "Downloading sequences from HDFS: ($seq1) and ($seq2) for generating ($output_file)" >&2 $HADOOP_HOME/bin/hadoop fs -get /dotplot_input_data/$seq1. $HADOOP_HOME/bin/hadoop fs -get /dotplot_input_data/$seq2. # Assign execution right to AllFragv3 chmod +x $app $CMD ls -la >&2 # Uploads output file to HDFS echo "Uploading output files to HDFS" >&2 $HADOOP_HOME/bin/hadoop fs -put./$output_file* /dotplot_output # Writes something to standard output echo "Files $seq1 and $seq2 processed. Output file is $output_file!!" # Script report status exit 0 16/16