The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.



Similar documents
Single Node Setup. Table of contents

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

TP1: Getting Started with Hadoop

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Single Node Hadoop Cluster Setup

Hadoop (pseudo-distributed) installation and configuration

CDH 5 Quick Start Guide

How To Install Hadoop From Apa Hadoop To (Hadoop)

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Tutorial- Counting Words in File(s) using MapReduce

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

HDFS Installation and Shell

Installation and Configuration Documentation

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Running Kmeans Mapreduce code on Amazon AWS

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop Installation. Sandeep Prasad

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

IDS 561 Big data analytics Assignment 1

Hadoop Installation MapReduce Examples Jake Karnes

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

HADOOP - MULTI NODE CLUSTER

Deploying MongoDB and Hadoop to Amazon Web Services

2.1 Hadoop a. Hadoop Installation & Configuration

Hadoop Tutorial. General Instructions

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

HSearch Installation

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

CDH installation & Application Test Report

Introduction to HDFS. Prasanth Kothuri, CERN

Important Notice. (c) Cloudera, Inc. All rights reserved.

Introduction to Cloud Computing

Hadoop Training Hands On Exercise

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Hadoop Lab - Setting a 3 node Cluster. Java -

OpenGeo Suite for Linux Release 3.0

Apache Hadoop new way for the company to store and analyze big data

Cassandra Installation over Ubuntu 1. Installing VMware player:

Introduction to HDFS. Prasanth Kothuri, CERN

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

RHadoop Installation Guide for Red Hat Enterprise Linux

Tutorial for Assignment 2.0

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer

Hadoop Installation Guide

INSTALLING KAAZING WEBSOCKET GATEWAY - HTML5 EDITION ON AN AMAZON EC2 CLOUD SERVER

Hadoop Basics with InfoSphere BigInsights

Using The Hortonworks Virtual Sandbox

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

How To Use Hadoop

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Back Up Linux And Windows Systems With BackupPC

Virtual Machine (VM) For Hadoop Training

How to install Apache Hadoop in Ubuntu (Multi node setup)

Contents Set up Cassandra Cluster using Datastax Community Edition on Amazon EC2 Installing OpsCenter on Amazon AMI References Contact

cloud-kepler Documentation

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

Perforce Helix Threat Detection On-Premise Deployment Guide

INASP: Effective Network Management Workshops

Spectrum Scale HDFS Transparency Guide

19 Putting into Practice: Large-Scale Data Management with HADOOP

Map Reduce & Hadoop Recommended Text:

Installation Guide. Copyright (c) 2015 The OpenNMS Group, Inc. OpenNMS SNAPSHOT Last updated :19:20 EDT

Hadoop Hands-On Exercises

Partek Flow Installation Guide

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Backing Up Your System With rsnapshot

Getting Hadoop, Hive and HBase up and running in less than 15 mins

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

SparkLab May 2015 An Introduction to

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Running Hadoop on Windows CCNP Server

L1: Introduction to Hadoop

Distributed Filesystems

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Tableau Spark SQL Setup Instructions

Platfora Installation Guide

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Using BAC Hadoop Cluster

Setting up Hadoop with MongoDB on Windows 7 64-bit

Extending Remote Desktop for Large Installations. Distributed Package Installs

Step One: Installing Rsnapshot and Configuring SSH Keys

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

IBM Smart Cloud guide started

ULTEO OPEN VIRTUAL DESKTOP V4.0

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

A SHORT INTRODUCTION TO DUPLICITY WITH CLOUD OBJECT STORAGE. Version

Tutorial for Assignment 2.0

Transcription:

Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone (or local) mode There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development since it is easy to test and debug them. It is not necessarily adequate for running other applications that we will learn about like Hive, Pig, etc. Pseudo-distributed mode The Haddop daemons run on the local machine, thus simulating a cluster on a small scale. Running in this mode is recommended prior to running on a real cluster. Fully distributed mode The Hadoop daemons run a cluster of machines. More details are required to run in Fully distributed mode than are provided in this lab. Platforms: Hadoop is written in Java, so you will need to have Java version 6 or later. Hadoop runs on Unix and Windows. Linux is the only supported production platform, but other flavors of Unix (including Mac OS X) can be used to run Hadoop for development. Note: Windows is only supported as a development environment and also requires the dreaded Cygwin to run. During the Cygwin installation process, you need to include the openssh package if you plan to run Hadoop in pseudo-distributed mode. Getting the permission issues correct is difficult. You should be OK if you want to use it for standalone if you want to go this route. In this lab, I will use Ubuntu. Good news: I did get everything running on Ubuntu! Documentation and credits: Hadoop: The Definitive Guide, Tom White Apache Hadoop Quick Start http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html Apache Hadoop Cluster setup: http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html Cloudera: https://ccp.cloudera.com/display/cdhdoc/cdh3+installation#cdh3installation- InstallingCDH3onUbuntuSystems Running Hadoop on Windows with Eclipse (again, I had trouble getting this running due to Cygwin OpenSSH): http://ebiquity.umbc.edu/tutorials/hadoop/00%20-%20intro.html Sample instructions for installing Cygwin OpenSSH on Windows: http://chinese-watercolor.com/lrp/printsrv/cygwin-sshd.html

Setting up Hadoop on Ubuntu using Cloudera distribution: 1) Putty into an Ubuntu instance. 2) Add a software repository by creating a new file: $ sudo touch /etc/apt/sources.list.d/cloudera.list $ sudo vi /etc/apt/sources.list.d/cloudera.list add the following contents to the file: deb http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib deb-src http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib where: <RELEASE> is the name of your distribution, which you can find by running lsb_release -ca. For example, to install CDH3 for Ubuntu lucid, use lucid-cdh3 in the command above: deb http://archive.cloudera.com/debian lucid-cdh3 contrib deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib Verify you have updated this file properly! $ sudo cat /etc/apt/sources.list.d/cloudera.list 3) (Optional) You can add a repository key that enables you to verify that you are downloading genuine packages. Add the Cloudera Public GPG Key to your repository by executing the following command so you are not prompted $ sudo curl -s http://archive.cloudera.com/debian/archive.key sudo apt-key add - 4) Update APT package index: $ sudo apt-get update 5) Find and install the Hadoop core package. (Note: this is different from Cloudera s instructions. I could not get their instructions to work!) $ sudo apt-cache search hadoop $ sudo echo "deb http://archive.canonical.com/ lucid partner" sudo tee /etc/apt/sources.list.d/partner.list $ sudo apt-get update $ sudo apt-get install hadoop-0.20 6) Make sure your JAVA_HOME environment variable is defined: export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

7) Verify Hadoop is running: $ hadoop version 8) Install the following (required for pseudo-distributed and distributed modes): $ sudo apt-get install ssh

9) To run Hadoop in a particular mode, you need to do two things: set the appropriate properties, and start the Hadoop daemons. Standalone Operation By default, Hadoop is configured to run things in a non-distributed mode as a single Java process. As explained earlier, this is useful for debugging. You do not have to create a HDFS file system for standalone mode. The following copies an unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. $ mkdir input $ cp /usr/lib/hadoop-0.20/conf/*.xml input Verify: $ ls input Run one of the Hadoop examples. In the example below, I am running the MapReduce version of grep that counts the matches of a regex in the input. $ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar grep input output 'dfs[a-z.]+' 11/05/05 19:44:39 INFO jvm.jvmmetrics: Initializing JVM Metrics with processname=jobtracker, sessionid= 11/05/05 19:44:39 INFO util.nativecodeloader: Loaded the native-hadoop library 11/05/05 19:44:39 INFO mapred.fileinputformat: Total input paths to process : 7 11/05/05 19:44:40 INFO mapred.jobclient: Running job: job_local_0001. You can list other Hadoop examples by leaving off the arguments as follows: $ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar Run and document at least one other Hadoop example.

Extra Credit: Pseudo-Distributed Operation Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. Configuration Use the following /etc/hadoop-0.20/conf/core-site.xml: (this needs to be in the conf directory of the version of Hadoop you are running. If not use config on the Hadoop execution command line and specify an alternative location. <configuration> <property> <name>fs.default.name</name> <value>localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> Setup passphraseless ssh Now check that you can ssh to the localhost without a passphrase: (type exit to exit) $ ssh localhost If you cannot ssh to localhost without a passphrase, execute the following commands: $ sudo ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ sudo cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Execution Change file permissions: sudo chmod 777 -R hadoop /usr/lib/hadoop-0.20/ Change file ownership: $ sudo chown -R hdfs /usr/lib/hadoop-0.20 $ sudo chown -R hdfs /var/log/hadoop-0.20 Format a new distributed-filesystem: $ hadoop namenode format Set JAVA_HOME and the various USER variables for hadoop. Add the following lines to /usr/lib/hadoop-0.20/conf/hadoop-env.sh. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk export HADOOP_NAMENODE_USER=hdfs export HADOOP_DATANODE_USER=hdfs export HADOOP_SECONDARYNAMENODE_USER=hdfs export HADOOP_JOBTRACKER_USER=hdfs export HADOOP_TASKTRACKER_USER=hdfs To use the hdfs user, a password needs to be set. Use the following command to set the password for the hdfs user. $ sudo passwd hdfs Start The hadoop daemons: $ sudo /usr/lib/hadoop-0.20/bin/start-all.sh The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs). Hadoop commands needed to be run as the hdfs user. To switch to the hdfs user, use the following command: $ su hdfs To access the web interface, theopen ports 50030,50070, and 50075 on the AWS console. After all of this, I finally got everything to work correctl Browse the web-interface for the NameNode and the JobTracker, by default they are available at: NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/ Copy the input files into the distributed filesystem: $ bin/hadoop dfs -put conf input

Run some of the examples provided: $ bin/hadoop jar hadoop-examples.jar grep input output 'dfs[a-z.]+' Examine the output files: Copy the output files from the distributed filesystem to the local filesytem and examine them: $ bin/hadoop dfs -get output output $ cat output/* or View the output files on the distributed filesystem: $ bin/hadoop dfs -cat output/* When you're done, stop the daemons with: $ bin/stop-all.sh Submission Submit a short lab report demonstrating your results and providing feedback on the lab on Blackboard. Thanks, Jay Urbain