HADOOP CLUSTER SETUP GUIDE:



Similar documents
Hadoop Setup Walkthrough

HADOOP - MULTI NODE CLUSTER

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop MultiNode Cluster Setup

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

HSearch Installation

Hadoop (pseudo-distributed) installation and configuration

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

How to install Apache Hadoop in Ubuntu (Multi node setup)

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

How To Install Hadoop From Apa Hadoop To (Hadoop)

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Running Kmeans Mapreduce code on Amazon AWS

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop Lab - Setting a 3 node Cluster. Java -

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Running Knn Spark on EC2 Documentation

Hadoop Installation. Sandeep Prasad

Single Node Hadoop Cluster Setup

TP1: Getting Started with Hadoop

CycleServer Grid Engine Support Install Guide. version 1.25

Installation and Configuration Documentation

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Tableau Spark SQL Setup Instructions

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer

CS 455 Spring Word Count Example

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Hadoop Data Warehouse Manual

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

How To Use Hadoop

HDFS Installation and Shell

SAS Data Loader 2.1 for Hadoop

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Deploying MongoDB and Hadoop to Amazon Web Services

Hadoop Setup. 1 Cluster

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Hadoop Multi-node Cluster Installation on Centos6.6

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Hadoop Tutorial. General Instructions

Extending Remote Desktop for Large Installations. Distributed Package Installs

Tutorial- Counting Words in File(s) using MapReduce

Single Node Setup. Table of contents

docs.hortonworks.com

HOD Scheduler. Table of contents

Hadoop Installation Guide

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Integration Of Virtualization With Hadoop Tools

Spectrum Scale HDFS Transparency Guide

CS2510 Computer Operating Systems Hadoop Examples Guide

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

File Transfer Examples. Running commands on other computers and transferring files between computers

A Study of Data Management Technology for Handling Big Data

MapReduce. Tushar B. Kute,

Automated Offsite Backup with rdiff-backup

HDFS Cluster Installation Automation for TupleWare

Hadoop Basics with InfoSphere BigInsights

Hadoop Training Hands On Exercise

How to Run Spark Application

Apache Hadoop new way for the company to store and analyze big data

Hadoop 2.6 Configuration and More Examples

Revolution R Enterprise 7 Hadoop Configuration Guide

IDS 561 Big data analytics Assignment 1

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh

Configuring Secure Socket Layer (SSL) for use with BPM 7.5.x

LICENSE4J FLOATING LICENSE SERVER USER GUIDE

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How to install PowerChute Network Shutdown on VMware ESXi 3.5, 4.0 and 4.1

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

LAE Enterprise Server Installation Guide

Deploying Apache Hadoop with Colfax and Mellanox VPI Solutions

Getting Hadoop, Hive and HBase up and running in less than 15 mins

CONNECTING TO DEPARTMENT OF COMPUTER SCIENCE SERVERS BOTH FROM ON AND OFF CAMPUS USING TUNNELING, PuTTY, AND VNC Client Utilities

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

2.1 Hadoop a. Hadoop Installation & Configuration

Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster

Big Data Analytics by Using Hadoop

installation administration and monitoring of beowulf clusters using open source tools

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

Adobe Marketing Cloud Using FTP and sftp with the Adobe Marketing Cloud

Backup of ESXi Virtual Machines using Affa

Hadoop Basics with InfoSphere BigInsights

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0

Basic Hadoop Programming Skills

Transcription:

HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do so, log in to your account and run the following commands: [For Further understanding, look up: https://hadoopabcd.wordpress.com/2015/05/16/linuxpassword-less-ssh-logins/ ] ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Setting up Config Files: Step 1. Create a new folder for your configuration files: mkdir ~/hadoopconf Step 2. Create core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml. Click on the link below to download the sample core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml: wget http://www.cs.colostate.edu/~cs435/pa1/conf.tar tar xvf conf.tar Copy configuration files into your configuration directory: mv * ~/hadoopconf For the files core-site.xml, mapred-site.xml and hdfs-site.xml, anything with HOST or PORT in the config files should be replaced with actual hostnames and port numbers of your choice. The list of hostnames in CSB120 is available at: http://www.cs.colostate.edu/~info/machines Make sure that you should select at least 10 nodes (CS120 machines only) for your slaves file that is detailed below in Step 4. You need to select a set of available ports from the non-privileged port range. Do not select any ports between 56,000 and 57,000. These are dedicated for the shared HDFS cluster. To reduce possible port conflicts, you will be each assigned a port range. 1/6

http://www.cs.colostate.edu/~cs435/pa1/pa1-portassignment.pdf Step 3. Set the environment variables in your.bashrc file export HADOOP_HOME=/usr/local/hadoop export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export HADOOP_MAPRED_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP_CONF_DIR= <path to your config directory mentioned in Step 2> export YARN_CONF_DIR=${HADOOP_CONF_DIR} export HADOOP_LOG_DIR= /tmp/${user}/hadoop-logs export YARN_LOG_DIR= /tmp/${user}/yarn-logs export JAVA_HOME=/usr/local/jdk1.8.0_51 export HADOOP_OPTS="-Dhadoop.tmp.dir=/s/${HOSTNAME}/a/tmp/hadoop-${USER}" Step 4. Create masters and slaves file Masters file should contain the name of the machine you choose as your secondary namenode. Slaves contains the list of all your worker nodes. Programming assignment 1 requires at least 10 such nodes. Running HDFS a. First Time Configuration Step 1. Log into your namenode You will have specified your namenode in core-site.xml. <property> <name>fs.default.name</name> <value>hdfs://host:port</value> </property> Step 2. Format your namenode. $HADOOP_HOME/bin/hdfs namenode -format b. Starting HDFS Run the start script: $HADOOP_HOME/sbin/start-dfs.sh 2/6

Step 3. Check the web portal to make sure that HDFS is running, open on your browser the url http://<namenode>:<port given> for the property dfs.namenode.http-address in hdfs-site.xml c. Stopping HDFS Step 1. Log into your namenode Step 2. Run the stop script: $HADOOP_HOME/sbin/stop-dfs.sh Running Yarn a. Starting Yarn Step 1. Log into your resource manager (Resource Manager is the node you set as HOST in the yarn-site.xml) Step 2. Run the start script $HADOOP_HOME/sbin/start-yarn.sh Step 3. Check the web portal to make sure yarn is running http://<resourcemanager>:<port given for the property yarn.resourcemanager.webapp.address> b. Stoping Yarn Step 1. Log into your resource manager Step 2. Run the stop script $HADOOP_HOME/sbin/stop-yarn.sh Running a Job Locally a. Setup Step 1. Open mapred-site.xml Step 2. Change the value of property mapreduce.framework.name to local. To be safe, restart HDFS and ensure yarn is stopped 3/6

b. Running the Job From any node: $HADOOP_HOME/bin/hadoop jar your.jar yourclass args a. Setup Step 1. Open mapred-site.xml Running a Job in Yarn Step 2. Change the value of property mapreduce.framework.name to yarn To be safe, restart HDFS and yarn b. Running the Job From any node: $HADOOP_HOME/bin/hadoop jar your.jar yourclass args Accessing Shared Data Sets We are using ViewFS based federation for sharing data sets due space constraints. Your MapReduce programs will deal with two name nodes in this setup. 1. Name node of your own HDFS cluster. This name node is used for writing the output generated by your program. We will refer to this as the local HDFS. 2. Name node of the shared HDFS cluster. This name node hosts inputs for your data and it is made read only. You should not attempt to write to this cluster. This is only for staging input data. Datasets are hosted in /data directory of shared file system. You will get r+x permission for this directory. This will allow your MapReduce program to traverse inside this directory and read files. ViewFS based federation is conceptually similar to mounts in Linux file systems. Different directories visible to the client are mounted to different locations in namespaces of name nodes. This provides a seamless view to the client as if it is accessing a single file system. There are three mounts created in the core-site.xml of the client s end. Mount Point Name Node Directory in Name Node /home Local / /tmp Local /tmp 4/6

/data Shared /data Following figure provides a graphical representation of these mount points. For example, if you want to read the Gutenberg project data, the input file path should be provided as /data/gutenberg which points to the /data/gutenberg in the shared HDFS. If you write the output to /home/output-1, then it will be written /output-1 of the local HDFS. You can also use /nobackup directory. /nobackup directory is not deleted every week, however if the cluster is down, you will not be able to recover your files in this directory. a. Running a MapReduce Client Usually a client uses the same Hadoop configuration used for creating the cluster when running MapReduce programs. In this setup, the Hadoop configuration for a client is different mainly due to the mount points discussed above. Also your MapReduce program will be running on the MapReduce runtime based on the shared HDFS. In other words, your local HDFS cluster will be used only for storing the output. You will read input and run your MapReduce programs on the shared Hadoop cluster. This is to ensure the data locality for input data. 5/6

Follow the following steps when running a MapReduce program. 1. Download the client-config.tar.gz directory from http://www.cs.colostate.edu/~cs435/pa1/sharedfileclient-config.tar 2. Extract the tarball. 3. This configuration directory contains 3 files. You only need to update the core-site.xml file. Other files should not be changed. Update the NAMENODE and PORT information in the core-site.xml with hostname and port of the name node of your local cluster. You can find this information in the core-site.xml of your local HDFS cluster configuration. There will be an entry with honolulu as the name node hostname in the core-site.xml file.. Do not change this entry. It points to the name node of the shared cluster. 4. Now you need to update the environment variable HADOOP_CONF_DIR. This has already being set in your.*rc file of your shell (.bashrc for Bash and.rcsh for Tcsh). Do not update the environment variable set in the.*rc file, because you are running your local Hadoop cluster based on that parameter. Instead, just overwrite this environment variable for the current shell instance. export HADOOP_CONF_DIR=/path/to/extracted-client-config This overwritten value is valid only for that particular instance of the shell. You need to overwrite this environment variable every time you spawn a new shell to run a MapReduce program. 5. Now you can run your Hadoop program as usual. Make sure to include the paths using mount points. For example: Suppose that you need to run your data analysis program. The output should be written to /gutenberg/output directory in your local cluster. Let s assume that the jar is at your home directory and this job only takes two parameters, the input path and the output path respectively. The command to run such a program is: $HADOOP_HOME/bin/hadoop jar ~/analysis.jar com.user.mainjob /data/gutenberg /home/gutenberg/output 6/6