Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit /etc/hosts on every node If you have, for examples, the following nodes: (Remember to replace the hostname according to your machine. eg.ubuntu01-01, ubuntu01-02) master: IP: 192.168.0.1 hostname: ubuntu01-01 slaves: IP: 192.168.0.2 IP: 192.168.0.3 IP: 192.168.0.4 IP: 192.168.0.5 hostname: ubuntu01-02 hostname: ubuntu01-03 hostname: ubuntu01-04 hostname: ubuntu01-05 Then add the following lines in /etc/hosts on every node: # /etc/hosts (for master AND slave) 192.168.0.1 ubuntu01-01 192.168.0.2 ubuntu01-02 192.168.0.3 ubuntu01-03 192.168.0.4 ubuntu01-04 192.168.0.5 ubuntu01-05 2. Configure For master: edit conf/masters as follow: ubuntu01-01 edit conf/slaves as follow: ubuntu01-02 ubuntu01-03 ubuntu01-04 ubuntu01-05
For every node do the followings: 1). Configure JAVA_HOME $ cd hadoop $ gedit conf/hadoop-env.sh And change: # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun to: # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java/jdk1.6.0_22 Save & exit. 2). Creates some directories in hadoop home: $ cd hadoop-0.20.2 $ mkdir tmp $ mkdir hdfs $ mkdir hdfs/name $ mkdir hdfs/data 3). Configurations setup Under conf/, edit the following files, note that "/path/to/your/hadoop" should be replaced with something like "/home/user/hadoop-0.20.2" conf/core-site.xml: <configuration> <name>fs.default.name</name> <value>hdfs://ubuntu01-01:9000</value> <name>hadoop.tmp.dir</name> <value> /tmp/hadoop-${user.name} </value> </configuration>
conf/hdfs-site.xml: <configuration> <name>dfs.replication</name> <value>3</value> <name>dfs.name.dir</name> <value>/home/${user.name}/hadoop/hdfs/name</value> <name>dfs.data.dir</name> <value>/home/${user.name}/hadoop/hdfs/data</value> <name>fs.checkpoint.dir</name> <value>/home/${user.name}/hdfs/namesecondary</value> </configuration> conf/mapred-site.xml: <configuration> <name>mapred.job.tracker</name> <value>ubuntu01-01:9001</value> </configuration> 4). Configure passphaseless ssh $ ssh localhost You will need password to log in ssh. $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ exit Configuration done. Try: $ ssh localhost You should now log in without password.
3. SSH Access master must have passphaseless log in authorities to all slaves. user@ubuntu01-01:~$ ssh-copy-id I $HOME/.ssh/id_rsa.pub user@ubuntu01-02 user@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub user@ubuntu01-03 user@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub user@ubuntu01-04 user@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub user@ubuntu01-05 You will need the corresponding slave's password to running the above commands. Try: user@ubuntu01-01:~$ ssh ubuntu01-02 user@ubuntu01-01:~$ ssh ubuntu01-03 user@ubuntu01-01:~$ ssh ubuntu01-04 user@ubuntu01-01:~$ ssh ubuntu01-05 You should now log in without password. 4. First run You should format the HDFS (Hadoop Distributed File System). Run the following command on the master: $ bin/hadoop namenode -format 5. Start Cluster 1). Start HDFS Daemons Run the following command on master: $ bin/start-dfs.sh Use the following command on every nodes to check the status of daemons: $ jps run jps on master, you should see something like this: 7803 NameNode 8354 SecondaryNameNode
run jps on slaves, you should see something like this: 2). Start MapReduce Daemons Run the following command on master: $ bin/start-mapred.sh Use the following command on every nodes to check the status of daemons: $ jps run jps on master, you should see something like this: 7803 NameNode 8547 TaskTracker 8422 JobTracker 8354 SecondaryNameNode run jps on slaves, you should see something like this: 8547 TaskTracker 6. Hadoop Web Interfaces There are some web interfaces that let you know what is going on with the running hadoop. http://localhost:50030/ web UI for MapReduce job tracker(s) http://localhost:50060/ web UI for task tracker(s) http://localhost:50070/ web UI for HDFS name node(s) 7. Run a Map Reduce Job, WordCount Create a directory named "input" in HDFS: $ bin/hadoop dfs -mkdir input
Copy some text file into input $ bin/hadoop dfs -put conf/* input Run WordCount $ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output Display output: $ bin/hadoop dfs -cat output/* 8. Stop Cluster Close MapReduce daemons Run on master: $ bin/stop-mapred.sh Close HDFS daemons Run on master: $ bin/stop-dfs.sh