Hadoop 2.6.0 Setup Walkthrough



Similar documents
HADOOP CLUSTER SETUP GUIDE:

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

How To Install Hadoop From Apa Hadoop To (Hadoop)

HADOOP - MULTI NODE CLUSTER

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Hadoop MultiNode Cluster Setup

How to install Apache Hadoop in Ubuntu (Multi node setup)

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Hadoop (pseudo-distributed) installation and configuration

Hadoop Installation. Sandeep Prasad

HADOOP MOCK TEST HADOOP MOCK TEST II

HSearch Installation

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Running Kmeans Mapreduce code on Amazon AWS

Deploying MongoDB and Hadoop to Amazon Web Services

Hadoop Lab - Setting a 3 node Cluster. Java -

Big Data Evaluator 2.1: User Guide

TP1: Getting Started with Hadoop

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Tutorial- Counting Words in File(s) using MapReduce

How To Use Hadoop

HOD Scheduler. Table of contents

Hadoop Multi-node Cluster Installation on Centos6.6

Spectrum Scale HDFS Transparency Guide

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer

Single Node Setup. Table of contents

Installation and Configuration Documentation

Getting Hadoop, Hive and HBase up and running in less than 15 mins

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Running Knn Spark on EC2 Documentation

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

LAE Enterprise Server Installation Guide

Hadoop Setup. 1 Cluster

Single Node Hadoop Cluster Setup

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

HDFS Cluster Installation Automation for TupleWare

Hadoop Training Hands On Exercise

Apache Hadoop new way for the company to store and analyze big data

Accelerating life sciences research

Hadoop 2.6 Configuration and More Examples

How to Install and Configure EBF15328 for MapR or with MapReduce v1

HDFS Installation and Shell

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

Revolution R Enterprise 7 Hadoop Configuration Guide

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

MapReduce. Tushar B. Kute,

docs.hortonworks.com

Hadoop Installation Guide

CS2510 Computer Operating Systems Hadoop Examples Guide

RDMA for Apache Hadoop User Guide

HDFS to HPCC Connector User's Guide. Boca Raton Documentation Team

Integration Of Virtualization With Hadoop Tools

map/reduce connected components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Distributed Filesystems

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Hadoop Tutorial. General Instructions

docs.hortonworks.com

Tutorial for Assignment 2.0

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

CDH 5 Quick Start Guide

Test-King.CCA Q.A. Cloudera CCA-500 Cloudera Certified Administrator for Apache Hadoop (CCAH)

Hadoop Data Warehouse Manual

Upgrading From PDI 4.1.x to 4.1.3

Tutorial for Assignment 2.0

YARN and how MapReduce works in Hadoop By Alex Holmes

Tableau Spark SQL Setup Instructions

HDFS Users Guide. Table of contents

Upgrading From PDI 4.0 to 4.1.0

Hadoop Architecture. Part 1

Chase Wu New Jersey Ins0tute of Technology

Virtual Machine (VM) For Hadoop Training

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster

HADOOP MOCK TEST HADOOP MOCK TEST

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Setting up Hadoop with MongoDB on Windows 7 64-bit

CS 455 Spring Word Count Example

Integrating VoltDB with Hadoop

SAS Data Loader 2.1 for Hadoop

Pivotal HD Enterprise

Getting started Cassandra Access control list

HADOOP MOCK TEST HADOOP MOCK TEST I

Distributed Framework for Data Mining As a Service on Private Cloud

9.4 Hadoop Configuration Guide for Base SAS. and SAS/ACCESS

Transcription:

Hadoop 2.6.0 Setup Walkthrough This document provides information about working with Hadoop 2.6.0. 1 Setting Up Configuration Files... 2 2 Setting Up The Environment... 2 3 Additional Notes... 3 4 Selecting Ports... 3 5 Running HDFS... 4 5.1 First Time Configuration... 4 5.2 Starting HDFS... 4 5.3 Stopping HDFS... 4 6 Running Yarn... 5 6.1 Starting Yarn... 5 6.2 Stopping Yarn... 5 7 Running a Job Locally... 5 7.1 Setup... 5 7.2 Running the Job... 5 8 Running a Job in Yarn... 6 8.1 Setup... 6 8.2 Running the Job... 6 9 Accessing Shared Data Sets... 7 Page 1 of 8

1 Setting Up Configuration Files Create a new folder for your configuration files: mkdir ~/hadoopconf Download core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml from the course wiki (http://www.cs.colostate.edu/~cs455/wiki/doku.php?id=tools:hw3) Put configuration files into your configuration folder mv *-site.xml ~/hadoopconf Modify your configuration files Anything with host or port should be modified to actual hostnames/port numbers. If the hostnames or ports should be matched across different files, it will be mentioned right above each file. Create a slaves file This should contain a list of all your worker nodes. Make sure you have at least 10 nodes 2 Setting Up The Environment Determine which shell you are using echo $SHELL For tcsh: Copy the environment variables for tcsh (from the course wiki) into your ~/.cshrc file setenv HADOOP_HOME /usr/local/hadoop setenv HADOOP_COMMON_HOME ${HADOOP_HOME} setenv HADOOP_HDFS_HOME ${HADOOP_HOME} setenv HADOOP_MAPRED_HOME ${HADOOP_HOME} setenv YARN_HOME ${HADOOP_HOME} setenv HADOOP_CONF_DIR <path to your conf dir> setenv YARN_CONF_DIR ${HADOOP_CONF_DIR} setenv HADOOP_LOG_DIR /tmp/${user}/hadoop-logs setenv YARN_LOG_DIR /tmp/${user}/yarn-logs setenv JAVA_HOME /usr/local/jdk1.8.0_51 Page 2 of 8

For bash: Copy the environment variables for Bash (from the course wiki) into your ~/.bashrc file export HADOOP_HOME=/usr/local/hadoop export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export HADOOP_MAPRED_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP_CONF_DIR=<path to your conf dir> export YARN_CONF_DIR=${HADOOP_CONF_DIR} export HADOOP_LOG_DIR=/tmp/${USER}/hadoop-logs export YARN_LOG_DIR=/tmp/${USER}/yarn-logs export JAVA_HOME=/usr/local/jdk1.8.0_51 Make sure to update the HADOOP_CONF_DIR to point to the hadoopconf directory created above. Close any open terminals, resume with a fresh terminal or source the.*rc file. Make sure to use jdk 1.8.0_51 (instead of 1.8.0_60) to avoid containers getting killed due to virtual memory issues. 3 Additional Notes For additional help, visit http://hadoop.apache.org/docs/current/hadoop-projectdist/hadoop-common/singlecluster.html 4 Selecting Ports You need to select a set of available ports from the non-privileged port range. Do not select any ports between 56,000 and 57,000. These are dedicated for the shared HDFS cluster. Page 3 of 8

5 Running HDFS 5.1 First Time Configuration Log into your namenode You will have specified which machine this is in your core-site.xml file: <property> <name>fs.default.name</name> <value>hdfs://host:port</value> </property> </configuration> Format your namenode $HADOOP_HOME/bin/hdfs namenode -format 5.2 Starting HDFS Log into your namenode Run the start script: $HADOOP_HOME/sbin/start-dfs.sh Check the web portal to make sure that HDFS is running: http://<namenode>:<port given for the property dfs.namenode.httpaddress > 5.3 Stopping HDFS Log into your namenode Run the stop script: $HADOOP_HOME/sbin/stop-dfs.sh Page 4 of 8

6 Running Yarn 6.1 Starting Yarn Log into your resource manager Run the start script: $HADOOP_HOME/sbin/start-yarn.sh Check the web portal to make sure yarn is running: http://<resourcemanager>:<port given for the property yarn.resourcemanager.webapp.address> 6.2 Stopping Yarn Log into your resource manager Run the stop script: $HADOOP_HOME/sbin/stop-yarn.sh 7 Running a Job Locally 7.1 Setup Open mapred-site.xml Change the value of property mapreduce.framework.name to local To be safe, restart HDFS and ensure yarn is stopped 7.2 Running the Job From any node: $HADOOP_HOME/bin/hadoop jar your.jar yourclass args Page 5 of 8

8 Running a Job in Yarn 8.1 Setup Open mapred-site.xml Change the value of property mapreduce.framework.name to yarn To be safe, restart HDFS and yarn 8.2 Running the Job From any node: $HADOOP_HOME/bin/hadoop jar your.jar yourclass args Page 6 of 8

9 Accessing Shared Data Sets We are using ViewFS based federation for sharing data sets due space constraints. Your MapReduce programs will deal with two name nodes in this setup. 1. Name node of your own HDFS cluster. This name node is used for writing the output generated by your program. We will refer to this as the local HDFS. 2. Name node of the shared HDFS cluster. This name node hosts inputs for your data and it is made read only. You should not attempt to write to this cluster. This is only for staging input data. Datasets are hosted in /data directory of shared file system. You will get r+x permission for this directory. This will allow your MapReduce program to traverse inside this directory and read files. ViewFS based federation is conceptually similar to mounts in Linux file systems. Different directories visible to the client are mounted to different locations in namespaces of name nodes. This provides a seamless view to the client as if it is accessing a single file system. There are three mounts created in the core-site.xml of the client s end. Mount Point Name Node Directory in Name Node /home Local / /tmp Local /tmp /data Shared /data Following figure provides a graphical representation of these mount points. For example, if you want to read the main dataset, the input file path should be provided as /data/main which points to the /data/main in the shared HDFS. If you write the output to /home/output-1, then it will be written /output-1 of the local HDFS. Page 7 of 8

Running a MapReduce Client Usually a client uses the same Hadoop configuration used for creating the cluster when running MapReduce programs. In this setup, the Hadoop configuration for a client is different mainly due to the mount points discussed above. Also your MapReduce program will be running on the MapReduce runtime based on the shared HDFS. In other words, your local HDFS cluster will be used only for storing the output. You will read input and run your MapReduce programs on the shared Hadoop cluster. This is to ensure the data locality for input data. Follow the following steps when running a MapReduce program. 1. Download the client-config.tar.gz directory from http://www.cs.colostate.edu/~cs555/clientconfig.tar.gz. 2. Extract the tarball. 3. This configuration directory contains 3 files. You only need to update the core-site.xml file. Other files should not be changed. Update the NAMENODE_HOST and NAMENODE_PORT information in the core-site.xml with hostname and port of the name node of your local cluster. You can find this information in the core-site.xml of your local HDFS cluster configuration. There will be an entry with augusta as the name node hostname in the core-site.xml file.. Do not change this entry. It points to the name node of the shared cluster. 4. Now you need to update the environment variable HADOOP_CONF_DIR. This has already being set in your.*rc file of your shell (.bashrc for Bash and.rcsh for Tcsh). Do not update the environment variable set in the.*rc file, because you are running your local Hadoop cluster based on that parameter. Instead, just overwrite this environment variable for the current shell instance. For tcsh; setenv HADOOP_CONF_DIR /path/to/extracted-client-config For Bash; export HADOOP_CONF_DIR=/path/to/extracted-client-config This overwritten value is valid only for that particular instance of the shell. You need to overwrite this environment variable every time you spawn a new shell to run a MapReduce program. 5. Now you can run your Hadoop program as usual. Make sure to include the paths using mount points. For example: Suppose that you need to run your airline data analysis program. The output should be written to /airline/output directory in your local cluster. Let s assume this job only takes two parameters, the input path and the output path respectively. $HADOOP_HOME/bin/hadoop jar airline-analysis.jar cs555.hadoop.airline.analysisjob /data/main /home/airline/output Page 8 of 8