This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Size: px

Start display at page:

Download "This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download."

Lydia Montgomery
8 years ago
Views:

AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

1 AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. 1) Start up 3 small/medium instances. Use small if you aren t going to use Bigtop. Use medium if you want to use Bigtop and other Apache Components on top of Hadoop. For this step I am going to use a medium instance. Go to to pick out an Ubuntu Lucid AMI. Use the select boxes to pick out a 64 bit Ubuntu Lucid AMI with EBS storage in the AMI region you are working on. I use US-EAST: Once you have the AMI, either click on the link to get a 8GB disk size instance or use the ec2-run-instance command: ec2-run-instances ami-a29943cb -k absolutesw -t m2.2xlarge -z "us-east-1d" -b /dev/sda1=snap-68e9db15:300:true -g default If you are doing a Bigtop build you will need more than a 8GB instance and will have to use the ec2-run-instances command available in the ec2-api package. Download and install this before running this command. For the ec2-run-instances command, use the keypair.pem file name as keypair, leave off the.pem extension and run resize2fs after the instance is up and running:

For this step I am going to use a medium instance. Go to http://cloud.ubuntu.com/ami to pick out an Ubuntu Lucid AMI.

2 Before: df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.9G 730M 6.8G 10% / none 1.9G 112K 1.9G 1% /dev none 1.9G 0 1.9G 0% /dev/shm none 1.9G 48K 1.9G 1% /var/run none 1.9G 0 1.9G 0% /var/lock none 1.9G 0 1.9G 0% /lib/init/rw /dev/sdb 394G 199M 374G 1% /mnt Grow the partition using ubuntu@ip :~$ sudo resize2fs /dev/sda1 After: ubuntu@ip :~$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda % / none % /dev none % /dev/shm none % /var/run none % /var/lock none % /lib/init/rw /dev/sdb % /mnt Label 1 of the instances master, the other 2 slaves. The master is configured for the namenode, the slaves run datanodes. 2) Download the hadoop 1.00.x distribution and jdk1.6.x to all 3 AWS instances. Install JAVA and set JAVA_HOME, verify Hadoop works under the master node first by creating formatting the name node and verifying HDFS works. First we will download Hadoop in the separate package and once we have the distributed mode configured and verified we will configure Bigtop to do the same. When formatting the namenode make sure you verify the NN is formatted successfully:

3 12/07/15 04:22:15 INFO common.storage: Storage directory /tmp/hadoop-ubuntu/dfs/name has been successfully formatted. Create a ssh key using: ssh-keygen -t rsa -P "" hadoop@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Edit hadoop-env.sh and hbase-env.sh and update JAVA_HOME to the Sun 1.6.X JDK. Make sure you use the Sun 6.x JDK. OpenJDK has intermittent bugs which are still there. Where you see localhost, replace this with the IP address of the master node. hdfs-site.xml. For debugging set fs.default.name to file:///home/ubuntu/testhbase <name>fs.default.name</name> <value>hdfs://localhost:8020</value> mapred-site.xml <name>mapred.job.tracker</name> <value>localhost:8021</value> core-site.xml Set dfs.replication to 3 for distributed mode <name>dfs.replication</name> <value>3</value> hbase-site.xml <name>hbase.cluster.distributed</name> <value>true</value>

OpenJDK has intermittent bugs which are still there. Where you see localhost, replace this with the IP address of the master node. hdfs-site.xml. For debugging set fs.default.

4 To mapred-site.xml: <name>mapred.job.tracker</name> <value>localhost:8021</value> </configuration> To core-site.xml: <name>mapred.job.tracker</name> <value>localhost:8021</value> </configuration> For comparison here is the bigtop configuration files in pseudo-distributed mode. core-site.xml: <name>fs.default.name</name> <value>hdfs://localhost:8020</value> <name>hadoop.tmp.dir</name> <value>/var/lib/hadoop/cache/${user.name}</value>  <name>hadoop.proxyuser.oozie.hosts</name> <value>*</value> <name>hadoop.proxyuser.oozie.groups</name> <value>*</value> hdfs-site.xml

tracker</name> <value>localhost:8021</value> </configuration> For comparison here is the bigtop configuration files in pseudo-distributed mode.

5 <name>dfs.replication</name> <value>1</value> <name>dfs.permissions</name> <value>false</value>  <name>dfs.safemode.extension</name> <value>0</value> <name>dfs.safemode.min.datanodes</name> <value>1</value>  <name>dfs.name.dir</name> <value>/var/lib/hadoop/cache/hadoop/dfs/name</value> </configuration> mapred-site.xml <name>mapred.job.tracker</name> <value>localhost:8021</value> </configuration> Hadoop Distributed Mode First make sure Hadoop works in distributed mode. You should be able to see the data node logs on the slaves which should be accepting data:

-- specify this so that running 'hadoop namenode -format' formats the right dir --> <name>dfs.name.dir</name> <value>/var/lib/hadoop/cache/hadoop/dfs/name</value> </configuration> mapred-site.

To verify you are in distributed mode go to Hadoop status page at http://master ip address: 50030 for the task trackers if you want MR or port 50070 or hdfs.

6 To verify you are in distributed mode go to Hadoop status page at ip address: for the task trackers if you want MR or port or hdfs. We started in this configuration a data node on the primary namenode also. The HBase performance tests use MR. Start the job trackers and task trackers and run(sudo service hadoop-jobtracker start; sudo service hadooptasktacker start): hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi to verify the MR daemons are happy. Map Reduce Task Trackers: HDFS DataNodes:

7 For a summary of HADOOP ports: For a summary of HBASE ports: 60030: HBase regionservers 60010: HBase Master node Running HBase on top of Hadoop First make sure HBase is installed on all the nodes and you have HBase versions which are compatible with Hadoop versions. For this document we are using Hadoop and HBase Use sudo apt-get install hbase\* Start a HBase shell on the master, create/list a table and verify there are rows being created and stored. Change hbase-site.xml to HDFS and repeat creating/listing tables.

on top of Hadoop First make sure HBase is installed on all the nodes and you have HBase versions which are compatible with Hadoop versions.

8 Masters: the address of the secondary namenode. The primary namenode goes on the node you run the scripts from. Slaves: ip addresses of the data nodes. Test by running bin/hadoop fs mkdir /temptest base-site.xml(change to hdfs for distributed hdfs://localhost:8020) <name>hbase.rootdir</name> <value>file:///home/ubuntu/testhbase</value> <name>hbase.cluster.distributed</name> <value>true</value> <description>the mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbaseenv.sh) </description> <name>hbase.zookeeper.quorum</name> <value>localhost</value>

rootdir</name> <value>file:///home/ubuntu/testhbase</value> <name>hbase.cluster.distributed</name> <value>true</value> <description>the mode the cluster will be in.

Single Node Setup. Table of contents

Single Node Setup. Table of contents Table of contents 1 Purpose... 2 2 Prerequisites...2 2.1 Supported Platforms...2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster... 3 5 Standalone