Hadoop Training Hands On Exercise



Similar documents
CDH 5 Quick Start Guide

Important Notice. (c) Cloudera, Inc. All rights reserved.

How To Install Hadoop From Apa Hadoop To (Hadoop)

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Hadoop Multi-node Cluster Installation on Centos6.6

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Installation and Configuration Documentation

Hadoop (pseudo-distributed) installation and configuration

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Running Kmeans Mapreduce code on Amazon AWS

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

HADOOP - MULTI NODE CLUSTER

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Single Node Hadoop Cluster Setup

Pivotal HD Enterprise

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

How To Use Hadoop

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Deploying MongoDB and Hadoop to Amazon Web Services

Introduction to Cloud Computing

IDS 561 Big data analytics Assignment 1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Running Knn Spark on EC2 Documentation

Hadoop Tutorial. General Instructions

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Hadoop Data Warehouse Manual

TP1: Getting Started with Hadoop

Hadoop MultiNode Cluster Setup

HSearch Installation

Hadoop Lab - Setting a 3 node Cluster. Java -

docs.hortonworks.com

Cloudera Manager Training: Hands-On Exercises

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

CS2510 Computer Operating Systems Hadoop Examples Guide

Hadoop Basics with InfoSphere BigInsights

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

How to install Apache Hadoop in Ubuntu (Multi node setup)

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Revolution R Enterprise 7 Hadoop Configuration Guide

Apache Hadoop new way for the company to store and analyze big data

Introduction to HDFS. Prasanth Kothuri, CERN

cloud-kepler Documentation

Hadoop Installation. Sandeep Prasad

Basic Hadoop Programming Skills

Introduction to HDFS. Prasanth Kothuri, CERN

Deploy and Manage Hadoop with SUSE Manager. A Detailed Technical Guide. Guide. Technical Guide Management.

Big Data Lab. MongoDB and Hadoop SGT, Inc. All Rights Reserved

IBM Software Hadoop Fundamentals

How to use. ankus v0.2.1 ankus community 작성자 : 이승복. This work is licensed under a Creative Commons Attribution 4.0 International License.

Single Node Setup. Table of contents

Hadoop Installation MapReduce Examples Jake Karnes

A. Aiken & K. Olukotun PA3

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

MapReduce. Tushar B. Kute,

How to Run Spark Application

Using BAC Hadoop Cluster

Tutorial- Counting Words in File(s) using MapReduce

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop Basics with InfoSphere BigInsights

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

OnCommand Performance Manager 1.1

Hadoop Setup Walkthrough

docs.hortonworks.com

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce, Hadoop and Amazon AWS

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Hadoop Hands-On Exercises

2.1 Hadoop a. Hadoop Installation & Configuration

Extreme computing lab exercises Session one

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Configuring Hadoop Security with Cloudera Manager

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

CDH installation & Application Test Report

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

HDFS Cluster Installation Automation for TupleWare

Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vsphere SOLUTION GUIDE

map/reduce connected components

Cassandra Installation over Ubuntu 1. Installing VMware player:

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Setup. 1 Cluster

Cloud Storage Quick Start Guide

Partek Flow Installation Guide

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Qsoft Inc

Transcription:

Hadoop Training Hands On Exercise

1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe and install Vmware player Step 2: Download and install the VMWare image - Download the Hadoop Training - Distribution.zip and unzip it on your windows machine - Click on centos- 6.3- x86_64- server.vmx to start the Virtual Machine Step 3: Login and a quick check - Once the VM starts, use the following credentials: Username: training Password: training - Quickly check if eclipse and mysql workbench are installed

2. Installing Hadoop in a pseudo distributed mode: Step 1: Run the following command to install hadoop from yum repository in a pseudo distributed mode (Already done for you, please don t run this command) sudo yum install hadoop- 0.20- conf- pseudo Step 2: Verify if the packages are installed properly rpm - ql hadoop- 0.20- conf- pseudo Step 3: Format the namenode sudo - u hdfs hdfs namenode - format Step 4: Stop existing services (As Hadoop was already installed for you, there might be some services running) $ for service in /etc/init.d/hadoop* > do > sudo $service stop > done Step 5: Start HDFS $ for service in /etc/init.d/hadoop- hdfs- * > do > sudo $service start > done

Step 6: Verify if HDFS has started properly (In the browser) http://localhost:50070 Step 7: Create the /tmp directory $ sudo - u hdfs hadoop fs - mkdir /tmp $ sudo - u hdfs hadoop fs - chmod - R 1777 /tmp Step 8: Create mapreduce specific directories sudo - u hdfs hadoop fs - mkdir /var sudo - u hdfs hadoop fs - mkdir /var/lib sudo - u hdfs hadoop fs - mkdir /var/lib/hadoop- hdfs sudo - u hdfs hadoop fs - mkdir /var/lib/hadoop- hdfs/cache sudo - u hdfs hadoop fs - mkdir /var/lib/hadoop- hdfs/cache/mapred sudo - u hdfs hadoop fs - mkdir /var/lib/hadoop- hdfs/cache/mapred/mapred sudo - u hdfs hadoop fs - mkdir /var/lib/hadoop- hdfs/cache/mapred/mapred/staging sudo - u hdfs hadoop fs - chmod 1777 /var/lib/hadoop- hdfs/cache/mapred/mapred/staging sudo - u hdfs hadoop fs - chown - R mapred /var/lib/hadoop- hdfs/cache/mapred Step 9: Verify the directory structure $ sudo - u hdfs hadoop fs - ls - R / Output should be

drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoophdfs/cache drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoophdfs/cache/mapred drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoophdfs/cache/mapred/mapred drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoophdfs/cache/mapred/mapred/staging Step 10: Start MapReduce $ for service in /etc/init.d/hadoop- 0.20- mapreduce- * > do > sudo $service start > done Step 11: Verify if MapReduce has started properly (In Browser) http://localhost:50030 Step 12: Verify if the installation went on well by running a program Step 12.1: Create a home directory on HDFS for the user sudo - u hdfs hadoop fs - mkdir /user/training sudo - u hdfs hadoop fs - chown training /user/training

Step 12.2: Make a directory in HDFS called input and copy some XML files into it by running the following commands $ hadoop fs - mkdir input $ hadoop fs - put /etc/hadoop/conf/*.xml input $ hadoop fs - ls input Found 3 items: - rw- r- - r- - 1 joe supergroup 1348 2012-02- 13 12:21 input/core- site.xml - rw- r- - r- - 1 joe supergroup 1913 2012-02- 13 12:21 input/hdfs- site.xml - rw- r- - r- - 1 joe supergroup 1001 2012-02- 13 12:21 input/mapred- site.xml Step 12.3: Run an example Hadoop job to grep with a regular expression in your input data. $ /usr/bin/hadoop jar /usr/lib/hadoop- 0.20- mapreduce/hadoop- examples.jar grep input output 'dfs[a- z.]+' Step 12.4: After the job completes, you can find the output in the HDFS directory named output because you specified that output directory to Hadoop. $ hadoop fs -ls Found 2 items drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output

Step 12.5: List the output files $ hadoop fs -ls output Found 2 items drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output/_logs -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output/part-00000 -rw-r--r- 1 joe supergroup 0 2009-02-25 10:33 /user/joe/output/_success Step 12.6: Read the output $ hadoop fs -cat output/part-00000 head 1 dfs.datanode.data.dir 1 dfs.namenode.checkpoint.dir 1 dfs.namenode.name.dir 1 dfs.replication 1 dfs.safemode.extension 1 dfs.safemode.min.datanodes

3. Accessing HDFS from command line: This exercise is just to you familiar with HDFS. Run the following commands: Command 1: List the files in the user/training directory $> hadoop fs - ls Command 2: List the files in the root directory $> hadoop fs ls / Command 3: Push a file to HDFS $> hadoop fs put test.txt /user/training/test.txt Command 4: View the contents of the file $> hadoop fs cat /user/training/test.txt Command 5: Delete a file $> hadoop fs rmr /user/training/test.txt

4. Running the Wordcount Mapreduce job Step 1: Put the data in the HDFS hadoop fs - mkdir /user/training/wordcountinput hadoop fs put wordcount.txt /user/training/wordcountinput Step 2: Create a new project in eclipse called wordcount 1. cp r /home/training/exercises/wordcount /home/training/workspace/wordcount 2. Open Eclipseà New Project- >wordcount- >location /home/training/workspace 3. Right Click on the wordcount project- >properties- >java build path- >Libraries- >Add External Jarsà Select all jars from /usr/lib/hadoop and /usr/lib/hadoop- 0.20- mapreduceà Ok 4. Make sure that there are no more compilation errors Step 3: Create a jar file 1. Right click the project- à Exportà Javaà Jarà Select the location as /home/trainingà Make sure workdcount is checkedà Finish Step 4 Run the jar file hadoop jar wordcount.jar WordCount wordcountinput wordcountoutput

5. Mini Project: Importing MySQL Data Using Sqoop and Querying it using Hive 5.1 Setting up Sqoop Step 1: Install Sqoop (Already done for you, please don t run this command) $> sudo yum install sqoop Step 2: View list of databases $> sqoop list- databases \ - - connect jdbc:mysql://localhost/training_db \ - - username root - - password root Step 3: View list of tables $> sqoop list- tables \ - - connect jdbc:mysql://localhost/training_db \ - - username root - - password root Step 4: Import data to HDFS $> sqoop import \ - - connect jdbc:mysql://localhost/training_db \ - - table user_log - - fields- terminated- by '\t' \ - m 1 - - username root - - password root

5.2 Setting up Hive Step 1: Install Hive $> sudo yum install hive (Already done for you, don t run this command) $> sudo u hdfs hadoop fs mkdir /user/hive/warehouse $> hadoop fs chmod g+w /tmp $> sudo u hdfs hadoop fs chmod g+w /user/hive/warehouse $> sudo u hdfs hadoop fs chown R training /user/hive/warehouse $>sudo chmod 777 /var/lib/hive/metastore $> hive Hive>show tables; Step 2: Create table hive> create table user_log (country STRING,ip_address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; Step 3: Load Data hive> LOAD DATA INPATH "/user/training/user_log/part- m- 00000" INTO TABLE user_log; Step 4: Run the query $> select country,count(1) from user_log group by country;

6. Setting up Flume Step 1: Install Flume $> sudo yum install flume- ng (Already done for you, please don t run this command) $> sudo u hdfs hadoop fs chmod 1777 /user/training Step 2: Copy the configuration file $> sudo cp /home/training/exercises/flume- config/flume.conf /usr/lib/flume- ng/conf Step 3: Start the flume agent $> flume- ng agent - - conf- file /usr/lib/flume- ng/conf/flume.conf - - name agent - Dflume.root.logger=INFO,console Step 4: Push the file in a different terminal $> sudo cp /home/training/exercises/log.txt /home/training Step 5: View the output $> hadoop fs ls logs

7. Setting up a multi node cluster Step 1: For converting the pseudo distributed mode to distributed mode, the first step is to stop the existing services (To be done on all nodes) $> for service in /etc/init.d/hadoop* > do > sudo $service stop > done Step 2: Create a new set of blank configuration files. The conf.empty directory contains blank files, so we will copy those to a new directory (To be done on all nodes) $> sudo cp r /etc/hadoop/conf.empty \ > /etc/hadoop/conf.class Step 3: Point Hadoop configuration to the new configuration (To be done on all nodes) $> sudo /usr/sbin/alternatives - install \ > /etc/hadoop/conf hadoop- conf \ > /etc/hadoop/conf.class 99 Step 4: Verify Alternatives (To be done on all nodes) $> /usr/sbin/update- alternatives \ > - - display hadoop- conf Step 5: Setting up the hosts (To be done on all nodes)

Step 5.1: Find the IP address of your machine $> /sbin/ifconfig Step 5.2: List down all the IP Addresses in your cluster setup i.e. the ones that will belong to your cluster. And decide a name for each one. In our example, let s say we are trying to setup a 3 node cluster so we fetch IP address of each node and name it as namenode and datanode<n>. Update /etc/hosts file with IP addresses as shown. So /etc/hosts file on each node should look something like this 192.168.1.12 namenode 192.168.1.21 datanode1 192.168.1.21 datanode2 Step 5.3: Update /etc/sysconfig/network file with Hostname Open the /etc/sysconfig/network on your local box and make sure that your hostname is namenode or datanode<n>. Assuming you have decided to become a datanode1 i.e. 192.168.1.21. So your hostname should be HOSTNAME=datanode1 HOSTNAME=Your node i.e. namenode or datanode1 Step 5.4: Restart your machine and try pining other machines Ping namenode Step 6: Changing configuration files (To be done on all nodes) The format to add the configuration parameter is <property> <name>property_name</name> <value>property_value</value> </property>

Add the following configurations in the following files Name Value Filename: /etc/hadoop/conf.class/core- site.xml fs.default.name hdfs://<namenode>:8020 Filename: /etc/hadoop/conf.class/hdfs- site.xml dfs.name.dir /home/disk1/dfs/nn,/home/disk2/dfs/nn dfs.data.dir /home/disk1/dfs/dn,/home/disk2/dfs/dn dfs.http.address namenode:50070 Filename: /etc/hadoop/conf.class/mapred- site.xml mapred.local.dir /home/disk1/mapred/local,/home/disk2/mapre d/local mapred.job.tracker namenode:8021 mapred.jobtracker.staging.ro /user ot.dir Step 7: Create necessary directories (To be done on all nodes) $> sudo mkdir p /home/disk1/dfs/nn $> sudo mkdir p /home/disk2/dfs/nn $> sudo mkdir p /home/disk1/dfs/dn $> sudo mkdir p /home/disk2/dfs/dn $> sudo mkdir p /home/disk1/mapred/local $> sudo mkdir p /home/disk2/mapred/local Step 8: Manage Permissions (To be done on all nodes) $> sudo chown R hdfs:hadoop /home/disk1/dfs/nn $> sudo chown R hdfs:hadoop /home/disk2/dfs/nn $> sudo chown R hdfs:hadoop /home/disk1/dfs/dn $> sudo chown R hdfs:hadoop /home/disk2/dfs/dn $> sudo chown R mapred:hadoop /home/disk1/mapred/local $> sudo chown R mapred:hadoop /home/disk2/mapred/local

Step 9: Reduce Hadoop Heapsize (To be done on all nodes) $> export HADOOP_HEAPSIZE=200 Step 10: Format the namenode (Only on Namenode) $> sudo u hdfs hadoop namenode - format On Namenode $> sudo /etc/init.d/hadoop- hdfs- namenode start $> sudo /etc/init.d/hadoop- hdfs- secondarynamenode start On Datanode $> sudo /etc/init.d/hadoop- hdfs- datanode start Step 11: Start HDFS processes Step 12: Create directories in HDFS (Only one member should do this) $> sudo u hdfs hadoop fs mkdir /user/training $> sudo u hdfs hadoop fs chown training /user/training $> sudo u hdfs hadoop fs mkdir /mapred/system $> sudo u hdfs hadoop fs chown mapred:hadoop \ >/mapred/system Step 13: Create directories for mapreduce (Only one member should do this)

Step 14: Start the Mapreduce process On Namenode $> sudo /etc/init.d/hadoop- 0.20- jobtracker start On Slave node $> sudo /etc/init.d/hadoop- 0.20- tasktracker start Step 15: Verify the cluster Visit http://namenode:50070 and look at number of nodes