Hadoop Training Hands On Exercise

1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe and install Vmware player Step 2: Download and install the VMWare image - Download the Hadoop Training - Distribution.zip and unzip it on your windows machine - Click on centos- 6.3- x86_64- server.vmx to start the Virtual Machine Step 3: Login and a quick check - Once the VM starts, use the following credentials: Username: training Password: training - Quickly check if eclipse and mysql workbench are installed

2. Installing Hadoop in a pseudo distributed mode: Step 1: Run the following command to install hadoop from yum repository in a pseudo distributed mode (Already done for you, please don t run this command) sudo yum install hadoop- 0.20- conf- pseudo Step 2: Verify if the packages are installed properly rpm - ql hadoop- 0.20- conf- pseudo Step 3: Format the namenode sudo - u hdfs hdfs namenode - format Step 4: Stop existing services (As Hadoop was already installed for you, there might be some services running) $ for service in /etc/init.d/hadoop* > do > sudo $service stop > done Step 5: Start HDFS $ for service in /etc/init.d/hadoophdfs- * > do > sudo $service start > done

Step 6: Verify if HDFS has started properly (In the browser) http://localhost:50070 Step 7: Create the /tmp directory $ sudo - u hdfs hadoop fs - mkdir /tmp $ sudo - u hdfs hadoop fs - chmod - R 1777 /tmp Step 8: Create mapreduce specific directories sudo - u hdfs hadoop fs - mkdir /var sudo - u hdfs hadoop fs - mkdir /var/lib sudo - u hdfs hadoop fs - mkdir /var/lib/hadoophdfs sudo - u hdfs hadoop fs - mkdir /var/lib/hadoophdfs/cache sudo - u hdfs hadoop fs - mkdir /var/lib/hadoophdfs/cache/mapred sudo - u hdfs hadoop fs - mkdir /var/lib/hadoophdfs/cache/mapred/mapred sudo - u hdfs hadoop fs - mkdir /var/lib/hadoophdfs/cache/mapred/mapred/staging sudo - u hdfs hadoop fs - chmod 1777 /var/lib/hadoophdfs/cache/mapred/mapred/staging sudo - u hdfs hadoop fs - chown - R mapred /var/lib/hadoophdfs/cache/mapred Step 9: Verify the directory structure $ sudo - u hdfs hadoop fs - ls - R / Output should be

drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoophdfs/cache drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoophdfs/cache/mapred drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoophdfs/cache/mapred/mapred drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoophdfs/cache/mapred/mapred/staging Step 10: Start MapReduce $ for service in /etc/init.d/hadoop- 0.20- mapreduce- * > do > sudo $service start > done Step 11: Verify if MapReduce has started properly (In Browser) http://localhost:50030 Step 12: Verify if the installation went on well by running a program Step 12.1: Create a home directory on HDFS for the user sudo - u hdfs hadoop fs - mkdir /user/training sudo - u hdfs hadoop fs - chown training /user/training

Step 12.2: Make a directory in HDFS called input and copy some XML files into it by running the following commands $ hadoop fs - mkdir input $ hadoop fs - put /etc/hadoop/conf/*.xml input $ hadoop fs - ls input Found 3 items: - rw- r- - r- - 1 joe supergroup 1348 2012-02- 13 12:21 input/core- site.xml - rw- r- - r- - 1 joe supergroup 1913 2012-02- 13 12:21 input/hdfs- site.xml - rw- r- - r- - 1 joe supergroup 1001 2012-02- 13 12:21 input/mapred- site.xml Step 12.3: Run an example Hadoop job to grep with a regular expression in your input data. $ /usr/bin/hadoop jar /usr/lib/hadoop- 0.20- mapreduce/hadoop- examples.jar grep input output 'dfs[a- z.]+' Step 12.4: After the job completes, you can find the output in the HDFS directory named output because you specified that output directory to Hadoop. $ hadoop fs -ls Found 2 items drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output

Step 12.5: List the output files $ hadoop fs -ls output Found 2 items drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output/_logs -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output/part-00000 -rw-r--r- 1 joe supergroup 0 2009-02-25 10:33 /user/joe/output/_success Step 12.6: Read the output $ hadoop fs -cat output/part-00000 head 1 dfs.datanode.data.dir 1 dfs.namenode.checkpoint.dir 1 dfs.namenode.name.dir 1 dfs.replication 1 dfs.safemode.extension 1 dfs.safemode.min.datanodes

3. Accessing HDFS from command line: This exercise is just to you familiar with HDFS. Run the following commands: Command 1: List the files in the user/training directory $> hadoop fs - ls Command 2: List the files in the root directory $> hadoop fs ls / Command 3: Push a file to HDFS $> hadoop fs put test.txt /user/training/test.txt Command 4: View the contents of the file $> hadoop fs cat /user/training/test.txt Command 5: Delete a file $> hadoop fs rmr /user/training/test.txt

4. Running the Wordcount Mapreduce job Step 1: Put the data in the HDFS hadoop fs - mkdir /user/training/wordcountinput hadoop fs put wordcount.txt /user/training/wordcountinput Step 2: Create a new project in eclipse called wordcount 1. cp r /home/training/exercises/wordcount /home/training/workspace/wordcount 2. Open Eclipseà New Project- >wordcount- >location /home/training/workspace 3. Right Click on the wordcount project- >properties- >java build path- >Libraries- >Add External Jarsà Select all jars from /usr/lib/hadoop and /usr/lib/hadoop- 0.20- mapreduceà Ok 4. Make sure that there are no more compilation errors Step 3: Create a jar file 1. Right click the project- à Exportà Javaà Jarà Select the location as /home/trainingà Make sure workdcount is checkedà Finish Step 4 Run the jar file hadoop jar wordcount.jar WordCount wordcountinput wordcountoutput

5. Mini Project: Importing MySQL Data Using Sqoop and Querying it using Hive 5.1 Setting up Sqoop Step 1: Install Sqoop (Already done for you, please don t run this command) $> sudo yum install sqoop Step 2: View list of databases $> sqoop list- databases \ - - connect jdbc:mysql://localhost/training_db \ - - username root - - password root Step 3: View list of tables $> sqoop list- tables \ - - connect jdbc:mysql://localhost/training_db \ - - username root - - password root Step 4: Import data to HDFS $> sqoop import \ - - connect jdbc:mysql://localhost/training_db \ - - table user_log - - fields- terminated- by '\t' \ - m 1 - - username root - - password root

5.2 Setting up Hive Step 1: Install Hive $> sudo yum install hive (Already done for you, don t run this command) $> sudo u hdfs hadoop fs mkdir /user/hive/warehouse $> hadoop fs chmod g+w /tmp $> sudo u hdfs hadoop fs chmod g+w /user/hive/warehouse $> sudo u hdfs hadoop fs chown R training /user/hive/warehouse $>sudo chmod 777 /var/lib/hive/metastore $> hive Hive>show tables; Step 2: Create table hive> create table user_log (country STRING,ip_address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; Step 3: Load Data hive> LOAD DATA INPATH "/user/training/user_log/part- m- 00000" INTO TABLE user_log; Step 4: Run the query $> select country,count(1) from user_log group by country;

6. Setting up Flume Step 1: Install Flume $> sudo yum install flume- ng (Already done for you, please don t run this command) $> sudo u hdfs hadoop fs chmod 1777 /user/training Step 2: Copy the configuration file $> sudo cp /home/training/exercises/flume- config/flume.conf /usr/lib/flume- ng/conf Step 3: Start the flume agent $> flume- ng agent - - conf- file /usr/lib/flume- ng/conf/flume.conf - - name agent - Dflume.root.logger=INFO,console Step 4: Push the file in a different terminal $> sudo cp /home/training/exercises/log.txt /home/training Step 5: View the output $> hadoop fs ls logs

7. Setting up a multi node cluster Step 1: For converting the pseudo distributed mode to distributed mode, the first step is to stop the existing services (To be done on all nodes) $> for service in /etc/init.d/hadoop* > do > sudo $service stop > done Step 2: Create a new set of blank configuration files. The conf.empty directory contains blank files, so we will copy those to a new directory (To be done on all nodes) $> sudo cp r /etc/hadoop/conf.empty \ > /etc/hadoop/conf.class Step 3: Point Hadoop configuration to the new configuration (To be done on all nodes) $> sudo /usr/sbin/alternatives - install \ > /etc/hadoop/conf hadoop- conf \ > /etc/hadoop/conf.class 99 Step 4: Verify Alternatives (To be done on all nodes) $> /usr/sbin/update- alternatives \ > - - display hadoop- conf Step 5: Setting up the hosts (To be done on all nodes)

Step 5.1: Find the IP address of your machine $> /sbin/ifconfig Step 5.2: List down all the IP Addresses in your cluster setup i.e. the ones that will belong to your cluster. And decide a name for each one. In our example, let s say we are trying to setup a 3 node cluster so we fetch IP address of each node and name it as namenode and datanode<n>. Update /etc/hosts file with IP addresses as shown. So /etc/hosts file on each node should look something like this 192.168.1.12 namenode 192.168.1.21 datanode1 192.168.1.21 datanode2 Step 5.3: Update /etc/sysconfig/network file with Hostname Open the /etc/sysconfig/network on your local box and make sure that your hostname is namenode or datanode<n>. Assuming you have decided to become a datanode1 i.e. 192.168.1.21. So your hostname should be HOSTNAME=datanode1 HOSTNAME=Your node i.e. namenode or datanode1 Step 5.4: Restart your machine and try pining other machines Ping namenode Step 6: Changing configuration files (To be done on all nodes) The format to add the configuration parameter is <property> <name>property_name</name> <value>property_value</value> </property>

Add the following configurations in the following files Name Value Filename: /etc/hadoop/conf.class/core- site.xml fs.default.name hdfs://<namenode>:8020 Filename: /etc/hadoop/conf.class/hdfs- site.xml dfs.name.dir /home/disk1/dfs/nn,/home/disk2/dfs/nn dfs.data.dir /home/disk1/dfs/dn,/home/disk2/dfs/dn dfs.http.address namenode:50070 Filename: /etc/hadoop/conf.class/mapred- site.xml mapred.local.dir /home/disk1/mapred/local,/home/disk2/mapre d/local mapred.job.tracker namenode:8021 mapred.jobtracker.staging.ro /user ot.dir Step 7: Create necessary directories (To be done on all nodes) $> sudo mkdir p /home/disk1/dfs/nn $> sudo mkdir p /home/disk2/dfs/nn $> sudo mkdir p /home/disk1/dfs/dn $> sudo mkdir p /home/disk2/dfs/dn $> sudo mkdir p /home/disk1/mapred/local $> sudo mkdir p /home/disk2/mapred/local Step 8: Manage Permissions (To be done on all nodes) $> sudo chown R hdfs:hadoop /home/disk1/dfs/nn $> sudo chown R hdfs:hadoop /home/disk2/dfs/nn $> sudo chown R hdfs:hadoop /home/disk1/dfs/dn $> sudo chown R hdfs:hadoop /home/disk2/dfs/dn $> sudo chown R mapred:hadoop /home/disk1/mapred/local $> sudo chown R mapred:hadoop /home/disk2/mapred/local

Step 9: Reduce Hadoop Heapsize (To be done on all nodes) $> export HADOOP_HEAPSIZE=200 Step 10: Format the namenode (Only on Namenode) $> sudo u hdfs hadoop namenode - format On Namenode $> sudo /etc/init.d/hadoophdfs- namenode start $> sudo /etc/init.d/hadoophdfs- secondarynamenode start On Datanode $> sudo /etc/init.d/hadoophdfs- datanode start Step 11: Start HDFS processes Step 12: Create directories in HDFS (Only one member should do this) $> sudo u hdfs hadoop fs mkdir /user/training $> sudo u hdfs hadoop fs chown training /user/training $> sudo u hdfs hadoop fs mkdir /mapred/system $> sudo u hdfs hadoop fs chown mapred:hadoop \ >/mapred/system Step 13: Create directories for mapreduce (Only one member should do this)

Step 14: Start the Mapreduce process On Namenode $> sudo /etc/init.d/hadoop- 0.20- jobtracker start On Slave node $> sudo /etc/init.d/hadoop- 0.20- tasktracker start Step 15: Verify the cluster Visit http://namenode:50070 and look at number of nodes