HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do so, log in to your account and run the following commands: [For Further understanding, look up: https://hadoopabcd.wordpress.com/2015/05/16/linuxpassword-less-ssh-logins/ ] ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Setting up Config Files: Step 1. Create a new folder for your configuration files: mkdir ~/hadoopconf Step 2. Create core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml. Click on the link below to download the sample core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml: wget http://www.cs.colostate.edu/~cs435/pa1/conf.tar tar xvf conf.tar Copy configuration files into your configuration directory: mv * ~/hadoopconf For the files core-site.xml, mapred-site.xml and hdfs-site.xml, anything with HOST or PORT in the config files should be replaced with actual hostnames and port numbers of your choice. The list of hostnames in CSB120 is available at: http://www.cs.colostate.edu/~info/machines Make sure that you should select at least 10 nodes (CS120 machines only) for your slaves file that is detailed below in Step 4. You need to select a set of available ports from the non-privileged port range. Do not select any ports between 56,000 and 57,000. These are dedicated for the shared HDFS cluster. To reduce possible port conflicts, you will be each assigned a port range. 1/6
http://www.cs.colostate.edu/~cs435/pa1/pa1-portassignment.pdf Step 3. Set the environment variables in your.bashrc file export HADOOP_HOME=/usr/local/hadoop export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export HADOOP_MAPRED_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP_CONF_DIR= <path to your config directory mentioned in Step 2> export YARN_CONF_DIR=${HADOOP_CONF_DIR} export HADOOP_LOG_DIR= /tmp/${user}/hadoop-logs export YARN_LOG_DIR= /tmp/${user}/yarn-logs export JAVA_HOME=/usr/local/jdk1.8.0_51 export HADOOP_OPTS="-Dhadoop.tmp.dir=/s/${HOSTNAME}/a/tmp/hadoop-${USER}" Step 4. Create masters and slaves file Masters file should contain the name of the machine you choose as your secondary namenode. Slaves contains the list of all your worker nodes. Programming assignment 1 requires at least 10 such nodes. Running HDFS a. First Time Configuration Step 1. Log into your namenode You will have specified your namenode in core-site.xml. <property> <name>fs.default.name</name> <value>hdfs://host:port</value> </property> Step 2. Format your namenode. $HADOOP_HOME/bin/hdfs namenode -format b. Starting HDFS Run the start script: $HADOOP_HOME/sbin/start-dfs.sh 2/6
Step 3. Check the web portal to make sure that HDFS is running, open on your browser the url http://<namenode>:<port given> for the property dfs.namenode.http-address in hdfs-site.xml c. Stopping HDFS Step 1. Log into your namenode Step 2. Run the stop script: $HADOOP_HOME/sbin/stop-dfs.sh Running Yarn a. Starting Yarn Step 1. Log into your resource manager (Resource Manager is the node you set as HOST in the yarn-site.xml) Step 2. Run the start script $HADOOP_HOME/sbin/start-yarn.sh Step 3. Check the web portal to make sure yarn is running http://<resourcemanager>:<port given for the property yarn.resourcemanager.webapp.address> b. Stoping Yarn Step 1. Log into your resource manager Step 2. Run the stop script $HADOOP_HOME/sbin/stop-yarn.sh Running a Job Locally a. Setup Step 1. Open mapred-site.xml Step 2. Change the value of property mapreduce.framework.name to local. To be safe, restart HDFS and ensure yarn is stopped 3/6
b. Running the Job From any node: $HADOOP_HOME/bin/hadoop jar your.jar yourclass args a. Setup Step 1. Open mapred-site.xml Running a Job in Yarn Step 2. Change the value of property mapreduce.framework.name to yarn To be safe, restart HDFS and yarn b. Running the Job From any node: $HADOOP_HOME/bin/hadoop jar your.jar yourclass args Accessing Shared Data Sets We are using ViewFS based federation for sharing data sets due space constraints. Your MapReduce programs will deal with two name nodes in this setup. 1. Name node of your own HDFS cluster. This name node is used for writing the output generated by your program. We will refer to this as the local HDFS. 2. Name node of the shared HDFS cluster. This name node hosts inputs for your data and it is made read only. You should not attempt to write to this cluster. This is only for staging input data. Datasets are hosted in /data directory of shared file system. You will get r+x permission for this directory. This will allow your MapReduce program to traverse inside this directory and read files. ViewFS based federation is conceptually similar to mounts in Linux file systems. Different directories visible to the client are mounted to different locations in namespaces of name nodes. This provides a seamless view to the client as if it is accessing a single file system. There are three mounts created in the core-site.xml of the client s end. Mount Point Name Node Directory in Name Node /home Local / /tmp Local /tmp 4/6
/data Shared /data Following figure provides a graphical representation of these mount points. For example, if you want to read the Gutenberg project data, the input file path should be provided as /data/gutenberg which points to the /data/gutenberg in the shared HDFS. If you write the output to /home/output-1, then it will be written /output-1 of the local HDFS. You can also use /nobackup directory. /nobackup directory is not deleted every week, however if the cluster is down, you will not be able to recover your files in this directory. a. Running a MapReduce Client Usually a client uses the same Hadoop configuration used for creating the cluster when running MapReduce programs. In this setup, the Hadoop configuration for a client is different mainly due to the mount points discussed above. Also your MapReduce program will be running on the MapReduce runtime based on the shared HDFS. In other words, your local HDFS cluster will be used only for storing the output. You will read input and run your MapReduce programs on the shared Hadoop cluster. This is to ensure the data locality for input data. 5/6
Follow the following steps when running a MapReduce program. 1. Download the client-config.tar.gz directory from http://www.cs.colostate.edu/~cs435/pa1/sharedfileclient-config.tar 2. Extract the tarball. 3. This configuration directory contains 3 files. You only need to update the core-site.xml file. Other files should not be changed. Update the NAMENODE and PORT information in the core-site.xml with hostname and port of the name node of your local cluster. You can find this information in the core-site.xml of your local HDFS cluster configuration. There will be an entry with honolulu as the name node hostname in the core-site.xml file.. Do not change this entry. It points to the name node of the shared cluster. 4. Now you need to update the environment variable HADOOP_CONF_DIR. This has already being set in your.*rc file of your shell (.bashrc for Bash and.rcsh for Tcsh). Do not update the environment variable set in the.*rc file, because you are running your local Hadoop cluster based on that parameter. Instead, just overwrite this environment variable for the current shell instance. export HADOOP_CONF_DIR=/path/to/extracted-client-config This overwritten value is valid only for that particular instance of the shell. You need to overwrite this environment variable every time you spawn a new shell to run a MapReduce program. 5. Now you can run your Hadoop program as usual. Make sure to include the paths using mount points. For example: Suppose that you need to run your data analysis program. The output should be written to /gutenberg/output directory in your local cluster. Let s assume that the jar is at your home directory and this job only takes two parameters, the input path and the output path respectively. The command to run such a program is: $HADOOP_HOME/bin/hadoop jar ~/analysis.jar com.user.mainjob /data/gutenberg /home/gutenberg/output 6/6