Using BAC Hadoop Cluster



Similar documents
Hadoop Tutorial. General Instructions

IDS 561 Big data analytics Assignment 1

Hadoop Installation MapReduce Examples Jake Karnes

Single Node Hadoop Cluster Setup

Download and Install the Citrix Receiver for Mac/Linux

CDH installation & Application Test Report

Hadoop Basics with InfoSphere BigInsights

IBM Software Hadoop Fundamentals

MapReduce, Hadoop and Amazon AWS

Installation Guide. Research Computing Team V1.9 RESTRICTED

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

WA1826 Designing Cloud Computing Solutions. Classroom Setup Guide. Web Age Solutions Inc. Copyright Web Age Solutions Inc. 1

Hadoop Lab - Setting a 3 node Cluster. Java -

Installing Matlab at Home

IRF2000 IWL3000 SRC1000 Application Note - Develop your own Apps with OSGi - getting started

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

How To Install Hadoop From Apa Hadoop To (Hadoop)

Hadoop Basics with InfoSphere BigInsights

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Important Notice. (c) Cloudera, Inc. All rights reserved.

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

CycleServer Grid Engine Support Install Guide. version 1.25

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

TFA Laptop Printing - Windows

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Installing the Android SDK

TP1: Getting Started with Hadoop

Introduction to Cloud Computing

In order to upload a VM you need to have a VM image in one of the following formats:

Hadoop Training Hands On Exercise

CS 378 Big Data Programming

Installation Guide. Version 2.1. on Oracle Java Cloud Service

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Accessing RCS IBM Console in Windows Using Linux Virtual Machine

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Upgrading from Call Center Reporting to Reporting for Contact Center. BCM Contact Center

Hadoop Setup. 1 Cluster

Getting Started using the SQuirreL SQL Client

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Instructions for Accessing the Advanced Computing Facility Supercomputing Cluster at the University of Kansas

WA2192 Introduction to Big Data and NoSQL. Classroom Setup Guide. Web Age Solutions Inc. Copyright Web Age Solutions Inc. 1

To reduce or not to reduce, that is the question

Configuring Color Access on the WorkCentre 7120 Using Microsoft Active Directory Customer Tip

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Enterprise Site Manager (ESM) & Administrator Console Installation / Uninstall

Hadoop Basics with InfoSphere BigInsights

ADFS 2.0 Application Director Blueprint Deployment Guide

TSM for Windows Installation Instructions: Download the latest TSM Client Using the following link:

Setup Guide for HDP Developer: Storm. Revision 1 Hortonworks University

RecoveryVault Express Client User Manual

Welcome to Collage (Draft v0.1)

Tutorial- Counting Words in File(s) using MapReduce

Citrix Client Install Instructions

Signiant Agent installation

Online Backup Client User Manual

Customer Tips. Configuring Color Access on the WorkCentre 7328/7335/7345 using Windows Active Directory. for the user. Overview

Basic Hadoop Programming Skills

How to install/upgrade the LANDesk virtual Cloud service appliance (CSA)

CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT

ORACLE BUSINESS INTELLIGENCE WORKSHOP. Prerequisites for Oracle BI Workshop

Online Backup Linux Client User Manual

1. Product Information

Online Backup Client User Manual

Setting up Remote Desktop

Sharp Remote Device Manager (SRDM) Server Software Setup Guide

How to connect to the University of Exeter VPN service

How To Set Up A Backupassist For An Raspberry Netbook With A Data Host On A Nsync Server On A Usb 2 (Qnap) On A Netbook (Qnet) On An Usb 2 On A Cdnap (

Online Backup Client User Manual Linux

SOS SO S O n O lin n e lin e Bac Ba kup cku ck p u USER MANUAL

Seagate Dashboard User Guide

Primavera P6 Professional Windows 8 Installation Instructions. Primavera P6. Installation Instructions. For Windows 8 Users

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Virtual Server Installation Manual April 8, 2014 Version 1.8

Application Servers - BEA WebLogic. Installing the Application Server

Using Internet or Windows Explorer to Upload Your Site

Raspberry Pi Setup Tutorial

OneLogin Integration User Guide

How to Install Multicraft on a VPS or Dedicated Server (Ubuntu bit)

Hadoop (pseudo-distributed) installation and configuration

Hadoop Installation. Sandeep Prasad

SSL VPN Support Guide

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

owncloud Configuration and Usage Guide

Using VirtualBox ACHOTL1 Virtual Machines

CommandCenter Secure Gateway

Guide to the Configuration and Use of SFTP Clients for Uploading Digital Treatment Planning Data to ITC

Click on Grant Guidelines for Empowering YOUth Initiatives Round 1. This will provide you with further details on this Approach to the Market.

OSF INTEGRATOR for. Integration Guide

Apache Hadoop new way for the company to store and analyze big data

Remote Access to Unix Machines

MATLAB on EC2 Instructions Guide

Echo360 Personal Capture

1. Downloading. 2. Installation and License Acquiring. Xilinx ISE Webpack + Project Setup Instructions

User Manual. User Manual Version

Verizon Remote Access User Guide

SeeTec ExpansionPackage

Centrify Identity and Access Management for Cloudera

Transcription:

Using BAC Hadoop Cluster Bodhisatta Barman Roy January 16, 2015 1

Contents 1 Introduction 3 2 Daemon locations 4 3 Pre-requisites 5 4 Setting up 6 4.1 Using a Linux Virtual Machine................... 6 4.2 Ensuring access to cluster...................... 6 4.3 Setting up the necessary software/tools on your laptop...... 7 5 Writing your first Map/Reduce program: WordCount 8 5.1 Fiddling with HDFS......................... 8 5.2 Writing a Map/Reduce program: WordCount........... 8 6 Other links 10 2

1 Introduction The BAC cluster has been setup with Apache Hadoop 1.2.1. Please contact Prof. Tan Kian-Lee to seek approval to use MRv2. This cluster has 7 nodes, bacn[0-6].comp.nus.edu.sg. bacn[0-5] are running CentOS 6.5, while bacn7 is running CentOS7. Each node in the cluster has 40 GB of RAM and 2 units of 6-core 2.1 GHz Intel Xeon E5-2620, along with 4x1TB 1 7200 RPM SATA drives. The total usable space for Hadoop File System (HDFS) is roughly about 24 TB 2. To give you an idea about how much data 24 TB can hold, the string caffeine uses 8 bytes (or 8B) - 1 byte for each character. The number of times the word caffeine can be stored in the BAC HDFS is 3 billion times 3. The Oxford English dictionary has about 350 million printed words 4. 1 TB=Terabyte 2 http://bacn0.comp.nus.edu.sg:50070 3 24T B 8B = 24 109 B =3 10 8B 9. 4 http://public.oed.com/history-of-the-oed/dictionary-facts/ 3

2 Daemon locations NN, SNN, JT: bacn0.comp.nus.edu.sg DN,TT: bacn[0-6].comp.nus.edu.sg A UNIX user hadoop is running all necessary daemons to run Map/Reduce programs on the cluster. 4

3 Pre-requisites The cluster has been tested with the following: 1. Eclipse Kepler: Eclipse IDE for Java Developers (http://eclipse.org/ downloads/packages/release/kepler/sr2) 2. Apache Hadoop 1.2.1 (https://archive.apache.org/dist/hadoop/core/ hadoop-1.2.1/). You may download the archive name hadoop-1.2.1.tar.gz. 3. Eclipse HDFS plugin for MapReduce (https://spideroak.com/share/ OB2WE3DJMNPXG5DVMZTA/so_stuff/home/jacob/SpiderOak%20Hive/public_ stuff/hadoop-eclipse-plugin-1.2.1.jar) 4. Mac OSX and a Ubuntu 14.04 x86-64. 5. SoC UNIXID. If you have a SoC UNIXID, you should have received a corresponding email address like SoC UNIXID @comp.nus.edu.sg. 6. Drop an email to Bodhi bodhi@comp.nus.edu.sg with your UNIXID requesting access to HDFS. 5

4 Setting up This section explains how to set up the software for writing MRv1 programs. NOTE: Please ensure that the user is within the SoC wireless/wired network. This means that the user can be physically in: Computing 1 Computing 2 or ICube lvl 3. Otherwise, the user has to connect to the SoC WebVPN network. The documentation for setting up VPN is available at https://docs.comp.nus. edu.sg/node/5065 and is outside the scope of this tutorial. 4.1 Using a Linux Virtual Machine The Hadoop cluster works with the following assumption that you re using your SoC UNIX ID to access the cluster. Therefore, to ensure that, it is advisable to use a Linux virtual machine. Please take care to ensure that the userid is the same as your SoC UNIX ID. You can use either VMWare Player or VirtualBox to setup a Linux virtual machine. You can choose any Linux distro of your choice. The following links provide instructions for installing Ubuntu, a popular Linux distribution: http://www.wikihow.com/install-ubuntu-on-virtualbox http://wiki.opencog.org/w/setting_up_ubuntu_in_vmware_for_noobs Please send your UNIX ID to Bodhi<bodhi@comp.nus.edu.sg> so that you can access the cluster. 4.2 Ensuring access to cluster 1. Mac and *nix users can open Terminal or similar applications and type ssh <UNIXID>@bacn0.comp.nus.edu.sg. Windows users can download SSH clients like Putty from www.putty. org 5. In the Host Name text-field, type bacn0.comp.nus.edu.sg and click on Open. 2. Accept the SSH key fingerprint by typing yes or clicking on Yes in the respective cases. 3. For Mac/*nix users, type in your <UNIXID> password on prompt and hit Enter. For Windows users, on the login as: prompt, type your UNIXID and hit Enter. The password is your <UNIXID> password. 5 Something to lose sleep over: http://bit.ly/1xuaaq5 6

On successful login, you should be able to see a prompt like [<UNIXID>@bacn0 ]$. You can logout by closing the respective applications. 4.3 Setting up the necessary software/tools on your laptop 1. Extract the Eclipse and Apache Hadoop archives onto your laptop. Note: If you are on Windows, you might consider downloading 7-Zip, a free utility to extract.tar.gz archives. 2. Copy the Eclipse plugin into the <Eclipse-directory>/plugins directory. 3. Start Eclipse IDE by executing <Eclipse-directory>/eclipse executable. 4. Click on Window Open Perspective Other 5. Click on Map/Reduce and then press OK. 6. On the bottom half of the screen, click on Map/Reduce Locations. 7. Click on the Blue Elephant icon on the bottom right (on hover, it says New Hadoop Location). 8. Put in the following entries: (a) Location Name: Any name of your choice, e.g. BAC Cluster. (b) Map/Reduce Master host: bacn0.comp.nus.edu.sg, port: 9001. This is the JobTracker URL. (c) DFS Master: bacn0.comp.nus.edu.sg, port: 9000. This is the HDFS URL. (d) User name: Your SoC UNIXID. (e) Click on Finish. You should be able to see the directories in the HDFS on expanding the DFS Locations expand-on-click menu. 7

5 Writing your first Map/Reduce program: Word- Count This section describes on how to write your very first Map/Reduce program on the cluster. The objective is to count the number of occurences of each word in a file. We shall download a file from the internet, upload it to HDFS, write the WordCount program and run it on the cluster. The file in concern is Shakespeare s Hamlet, available at http://www.gutenberg.org/ebooks/2265. txt.utf-8. Please note that the file is in plain-text. Marked-up formats like epub, Mobi etc. are out of scope for this tutorial. 5.1 Fiddling with HDFS 1. Please download Shakespeare s Hamlet from the aforementioned link. Save it as hamlet.txt. 2. From Eclipse, expand the DFS location into your user directory. It should be inside /user/ UNIXID. 3. Right-click on the directory, and select Create new directory, provide the name input and click Ok. 4. Right-click on the directory, and select Refresh. You should be able to see the directory input. 5. Right-click on input and select Upload files to DFS... Select hamlet.txt by navigating to the directory where it has been saved. Right-click on input and select Refresh. On expanding input, the file hamlet.txt can be seen. 5.2 Writing a Map/Reduce program: WordCount 1. In Eclipse, create a new Map/Reduce project (from the menu bar on top) by clicking File New Other and selecting Map/Reduce project and then click on Next. 2. Type WordCount as the name of the project and then click on Finish. The project WordCount can be seen in the Project Explorer on the left. On expanding WordCount, it can be observed that all the necessary drivers (i.e. JARs) have been added automatically to the project necessary to write any Map/Reduce program. 3. Select the directory named src and right-click it, then press New Class. Type WordCount in the text-field marked name and click on Finish. 4. Remove the contents of the file WordCount.java. 8

5. Copy the source code from http://pastie.org/9799572 and paste inside WordCount.java. 6. From the menu bar on top, click on Run and then Run Configurations... 7. Click the tab named Arguments, and in the text-field named Program Arguments type hdfs://bacn0.comp.nus.edu.sg:9000/user/<unixid> /input/hamlet.txthdfs://bacn0.comp.nus.edu.sg:9000/user/<unixid> /output. Please note that the two arguments are separated by a space character: the first indicating the location of the input file, the second specifying the output directory where the results of the program will be stored. Click on Apply and then Close. 8. In the Project Explorer, right-click on WordCount.java and go to Run as... and select 2. Run on Hadoop. The program should now start compiling. It can be seen that a Red square in the Console tab is active. In a few moments, one can observe some text in the Console. What happens in the background is this: (a) All the JAR files are compiled into one single JAR called WordCount.jar, is copied into the target cluster (BAC-cluster in our case), and gets executed. (b) The output of the program is stored in HDFS inside /user/ UNIXID /output. 9. Please wait until the Red square in the Console tab becomes inactive (turns grey). 10. Refresh the HDFS by right-clicking on /user/ UNIXID /input. A new directory by the name output can be observed. 11. On exapnding output, two new files can be seen: SUCCESS and part-r-00000. 12. The result of the program can be seen in part-r-00000. In this case, the result is the number of occurrences of every word in the file hamlet.txt. NOTE: If you wish to run this program again, ensure that the output directory doesn t exist in HDFS. In order to ensure the directory doesn t exist, right-click on output Delete and click on Ok. Otherwise, Hadoop will throw an error. 9

6 Other links The following links show the statuses of HDFS and M/R in the BAC cluster: 1. Hadoop M/R Jobtracker state: http://bacn0.comp.nus.edu.sg:50030 2. Hadoop M/R Tasktracker states: http://bacn0.comp.nus.edu.sg:50060 3. HDFS state: http://bacn0.comp.nus.edu.sg:50070 10