The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone (or local) mode There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development since it is easy to test and debug them. It is not necessarily adequate for running other applications that we will learn about like Hive, Pig, etc. Pseudo-distributed mode The Haddop daemons run on the local machine, thus simulating a cluster on a small scale. Running in this mode is recommended prior to running on a real cluster. Fully distributed mode The Hadoop daemons run a cluster of machines. More details are required to run in Fully distributed mode than are provided in this lab. Platforms: Hadoop is written in Java, so you will need to have Java version 6 or later. Hadoop runs on Unix and Windows. Linux is the only supported production platform, but other flavors of Unix (including Mac OS X) can be used to run Hadoop for development. Note: Windows is only supported as a development environment and also requires the dreaded Cygwin to run. During the Cygwin installation process, you need to include the openssh package if you plan to run Hadoop in pseudo-distributed mode. Getting the permission issues correct is difficult. You should be OK if you want to use it for standalone if you want to go this route. In this lab, I will use Ubuntu. Good news: I did get everything running on Ubuntu! Documentation and credits: Hadoop: The Definitive Guide, Tom White Apache Hadoop Quick Start http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html Apache Hadoop Cluster setup: http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html Cloudera: https://ccp.cloudera.com/display/cdhdoc/cdh3+installation#cdh3installation- InstallingCDH3onUbuntuSystems Running Hadoop on Windows with Eclipse (again, I had trouble getting this running due to Cygwin OpenSSH): http://ebiquity.umbc.edu/tutorials/hadoop/00%20-%20intro.html Sample instructions for installing Cygwin OpenSSH on Windows: http://chinese-watercolor.com/lrp/printsrv/cygwin-sshd.html

Setting up Hadoop on Ubuntu using Cloudera distribution: 1) Putty into an Ubuntu instance. 2) Add a software repository by creating a new file: $ sudo touch /etc/apt/sources.list.d/cloudera.list $ sudo vi /etc/apt/sources.list.d/cloudera.list add the following contents to the file: deb http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib deb-src http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib where: <RELEASE> is the name of your distribution, which you can find by running lsb_release -ca. For example, to install CDH3 for Ubuntu lucid, use lucid-cdh3 in the command above: deb http://archive.cloudera.com/debian lucid-cdh3 contrib deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib Verify you have updated this file properly! $ sudo cat /etc/apt/sources.list.d/cloudera.list 3) (Optional) You can add a repository key that enables you to verify that you are downloading genuine packages. Add the Cloudera Public GPG Key to your repository by executing the following command so you are not prompted $ sudo curl -s http://archive.cloudera.com/debian/archive.key sudo apt-key add - 4) Update APT package index: $ sudo apt-get update 5) Find and install the Hadoop core package. (Note: this is different from Cloudera s instructions. I could not get their instructions to work!) $ sudo apt-cache search hadoop $ sudo echo "deb http://archive.canonical.com/ lucid partner" sudo tee /etc/apt/sources.list.d/partner.list $ sudo apt-get update $ sudo apt-get install hadoop-0.20 6) Make sure your JAVA_HOME environment variable is defined: export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

7) Verify Hadoop is running: $ hadoop version 8) Install the following (required for pseudo-distributed and distributed modes): $ sudo apt-get install ssh

9) To run Hadoop in a particular mode, you need to do two things: set the appropriate properties, and start the Hadoop daemons. Standalone Operation By default, Hadoop is configured to run things in a non-distributed mode as a single Java process. As explained earlier, this is useful for debugging. You do not have to create a HDFS file system for standalone mode. The following copies an unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. $ mkdir input $ cp /usr/lib/hadoop-0.20/conf/*.xml input Verify: $ ls input Run one of the Hadoop examples. In the example below, I am running the MapReduce version of grep that counts the matches of a regex in the input. $ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar grep input output 'dfs[a-z.]+' 11/05/05 19:44:39 INFO jvm.jvmmetrics: Initializing JVM Metrics with processname=jobtracker, sessionid= 11/05/05 19:44:39 INFO util.nativecodeloader: Loaded the native-hadoop library 11/05/05 19:44:39 INFO mapred.fileinputformat: Total input paths to process : 7 11/05/05 19:44:40 INFO mapred.jobclient: Running job: job_local_0001. You can list other Hadoop examples by leaving off the arguments as follows: $ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar Run and document at least one other Hadoop example.

Extra Credit: Pseudo-Distributed Operation Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. Configuration Use the following /etc/hadoop-0.20/conf/core-site.xml: (this needs to be in the conf directory of the version of Hadoop you are running. If not use config on the Hadoop execution command line and specify an alternative location. <configuration> <property> <name>fs.default.name</name> <value>localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> Setup passphraseless ssh Now check that you can ssh to the localhost without a passphrase: (type exit to exit) $ ssh localhost If you cannot ssh to localhost without a passphrase, execute the following commands: $ sudo ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ sudo cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Execution Change file permissions: sudo chmod 777 -R hadoop /usr/lib/hadoop-0.20/ Change file ownership: $ sudo chown -R hdfs /usr/lib/hadoop-0.20 $ sudo chown -R hdfs /var/log/hadoop-0.20 Format a new distributed-filesystem: $ hadoop namenode format Set JAVA_HOME and the various USER variables for hadoop. Add the following lines to /usr/lib/hadoop-0.20/conf/hadoop-env.sh. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk export HADOOP_NAMENODE_USER=hdfs export HADOOP_DATANODE_USER=hdfs export HADOOP_SECONDARYNAMENODE_USER=hdfs export HADOOP_JOBTRACKER_USER=hdfs export HADOOP_TASKTRACKER_USER=hdfs To use the hdfs user, a password needs to be set. Use the following command to set the password for the hdfs user. $ sudo passwd hdfs Start The hadoop daemons: $ sudo /usr/lib/hadoop-0.20/bin/start-all.sh The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs). Hadoop commands needed to be run as the hdfs user. To switch to the hdfs user, use the following command: $ su hdfs To access the web interface, theopen ports 50030,50070, and 50075 on the AWS console. After all of this, I finally got everything to work correctl Browse the web-interface for the NameNode and the JobTracker, by default they are available at: NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/ Copy the input files into the distributed filesystem: $ bin/hadoop dfs -put conf input

Run some of the examples provided: $ bin/hadoop jar hadoop-examples.jar grep input output 'dfs[a-z.]+' Examine the output files: Copy the output files from the distributed filesystem to the local filesytem and examine them: $ bin/hadoop dfs -get output output $ cat output/* or View the output files on the distributed filesystem: $ bin/hadoop dfs -cat output/* When you're done, stop the daemons with: $ bin/stop-all.sh Submission Submit a short lab report demonstrating your results and providing feedback on the lab on Blackboard. Thanks, Jay Urbain