The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -



Similar documents
Hadoop Setup Walkthrough

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

How To Install Hadoop From Apa Hadoop To (Hadoop)

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop (pseudo-distributed) installation and configuration

HADOOP CLUSTER SETUP GUIDE:

Hadoop MultiNode Cluster Setup

HSearch Installation

TP1: Getting Started with Hadoop

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Apache Hadoop new way for the company to store and analyze big data

Hadoop Lab - Setting a 3 node Cluster. Java -

HADOOP - MULTI NODE CLUSTER

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Distributed Filesystems

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

MapReduce. Tushar B. Kute,

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Sriram Krishnan, Ph.D.

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

Tutorial- Counting Words in File(s) using MapReduce

Hadoop Installation. Sandeep Prasad

Hadoop on the Gordon Data Intensive Cluster

HDFS Cluster Installation Automation for TupleWare

How To Use Hadoop

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Single Node Setup. Table of contents

Hadoop Basics with InfoSphere BigInsights

HOD Scheduler. Table of contents

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

MapReduce, Hadoop and Amazon AWS

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

Single Node Hadoop Cluster Setup

Introduction to SDSC systems and data analytics software packages "

How To Run A Tompouce Cluster On An Ipra (Inria) (Sun) 2 (Sun Geserade) (Sun-Ge) 2/5.2 (

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

HDFS Installation and Shell

Hadoop Forensics. Presented at SecTor. October, Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE

HADOOP MOCK TEST HADOOP MOCK TEST II

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

HDFS. Hadoop Distributed File System

CS2510 Computer Operating Systems Hadoop Examples Guide

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Hadoop 2.6 Configuration and More Examples

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

FUJITSU Software Interstage Big Data Parallel Processing Server V User's Guide. Linux(64) J2UL ENZ0(00) October 2013 PRIMERGY

Hadoop Tutorial. General Instructions

Hadoop Multi-node Cluster Installation on Centos6.6

How to install Apache Hadoop in Ubuntu (Multi node setup)

HADOOP MOCK TEST HADOOP MOCK TEST I

IDS 561 Big data analytics Assignment 1

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

A. Aiken & K. Olukotun PA3

Running Kmeans Mapreduce code on Amazon AWS

Extreme computing lab exercises Session one

Getting Hadoop, Hive and HBase up and running in less than 15 mins

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data : Experiments with Apache Hadoop and JBoss Community projects

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Spectrum Scale HDFS Transparency Guide

Using BAC Hadoop Cluster

Experiences with Lustre* and Hadoop*

IBM Software Hadoop Fundamentals

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Hadoop Basics with InfoSphere BigInsights

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Hadoop. Sunday, November 25, 12

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Virtual Machine (VM) For Hadoop Training

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Chase Wu New Jersey Ins0tute of Technology

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduction to Cloud Computing

CSE-E5430 Scalable Cloud Computing. Lecture 4

Job Scheduling with Moab Cluster Suite

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Installing and running COMSOL on a Linux cluster

SAS Data Loader 2.1 for Hadoop

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Transcription:

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Hadoop Implementation on Riptide 2 Table of Contents Executive Summary 3 Getting Started 3 Hadoop 3 Introduction 3 Installation 4 myhadoop 4 Introduction 4 Installation 5 Configuration 5 Testing 8

Hadoop Implementation on Riptide 3 Executive Summary The Maui High Performance Computing Center is a Department of Defense (DoD) Supercomputing Resource Centers (DSRC) and is one of the five DSRCs in the DoD s High Performance Computing Modernization Program (HPCMP). It is a US Air Force Research Laboratory (AFRL) Center and part of the US Air Force Research Laboratory s Directed Energy Directorate at Kirtland Air Force Base, New Mexico. The Center has an infrastructure that is connected to the Defense Research and Engineering Network (DREN) and other proprietary networks. It has a large-scale parallel computing environment with terabytes of disk and online tape storage and high-speed communications. One of the HPC systems at MHPCC is called Riptide. Riptide is an IBM idataplex. The login and compute nodes are populated with two Intel Sandy Bridge 8-core processors. Riptide uses the FDR 10 Infiniband interconnect and uses IBM's General Parallel File System (GPFS). Riptide has 756 compute nodes with Direct Water Cooling; memory is not shared across the nodes. Each diskless compute node has two 8-core processors (16 cores) with its own Red Hat Enterprise Linux OS, sharing 32 GBytes of memory. Riptide is rated at 251.6 peak TFLOPS. The use of Hadoop on a large-scale computing platform came about as part of the research and development trust that the Center provides. What is Hadoop? Hadoop is an open-source software for distributed computing on large clusters. What separates this software from other distributed computing software is that it allows for the distributed processing of large data sets across clusters of commodity computers using simple programming models. It can scale up from single server to thousands of machines offering local computation and storage. Since Hadoop is designed to work on local storage, it has been the objective of this paper to prove that even with an infrastructure like Riptide with diskless compute nodes, Hadoop can still be installed and perused as it was originally designed for. This is where the myhadoop software comes in. myhadoop is a set of scripts that allows a Hadoop installation to work with a high-performance computing environment like Riptide. The MyHadoop is a study conducted at San Diego Supercomputer Center at the University of California at San Diego by Dr. Sriram Krishan. Further details on myhadoop can be found at http://myhadoop.sourceforge.net/. Getting Started Hadoop Introduction Since Hadoop is a framework that runs on large clusters, it needs a mechanism to distribute data across the compute nodes in the cluster and to be able to store files within its framework. This is resolved by its two main components: Map/Reduce and Hadoop Distributed File System (HDFS). Map/Reduce is Hadoop s implementation of dividing the application into multiple small pieces of work that can then be executed on any of the cluster s compute nodes. HDFS, on the other hand, is where data is stored. Hadoop stores data in files and does not index them. This is why Hadoop cannot be really a direct substitute for a database. Finding something in Hadoop usually means running a Map/Reduce job. The upside to this argument is that Hadoop excels in areas where data/information retrieval/search becomes too overwhelming for a regular database

Hadoop Implementation on Riptide 4 query. Furthermore, Hadoop, with its Map/Reduce component, can process very large unstructured and nonrelated data sets. Installation A valid Hadoop installation is the first prerequisite to getting the myhadoop setup on Riptide work. Hadoop version 0.20.2 is the recommended version for myhadoop to work. However, a later stable version can be used as well provided that the environment variables, corresponding file system paths, and use of necessary jars within the scripts are pointing to the latest installed version. Hadoop installers can be found at http://hadoop.apache.org/releases.html#download. At the time of this writing, the current stable release is the 1.2 version. Extract the downloaded compressed file into the folder of your choice. A typical Hadoop installation usually has the following files/folders: Figure 1 Typical Hadoop Installation On Riptide, this folder is located at /gpfs/pkgs/mhpcc/hadoop-<version>. In the scripts, this will be referred to as HADOOP_HOME. myhadoop Introduction As has been aforementioned, myhadoop is a set of scripts that enable a user to submit Hadoop jobs on an HPC system like Riptide via PBS. In typical HPC clusters, jobs are usually submitted in batch using some type of resource management system such as PBS. While Hadoop provide its own scheduling and manages it own job

Hadoop Implementation on Riptide 5 submissions, it may seem redundant for a Hadoop implementation to run on an HPC environment and sometimes considered a challenge. It is for this purpose that myhadoop is born. Performance-wise, Hadoop is known to scale on clusters of up to 4000 nodes. Sort performance is good on 900 nodes. Sorting 9TB of data on 900 nodes takes around 1.8 hours according to wiki.apache.org. Installation The installer can be downloaded from http://myhadoop.sourceforge.net. Unzip the downloaded compressed file into the folder of your choice. Once the files have been extracted, check out the preliminary details on how to get started modifying pertinent scripts within myhadoop under the docs folder. A typical myhadoop installation would have the following file structure: bin CHANGELOG docs etc LICENSE pbs-example.sh README RELEASE_NOTES sge-example.sh pbs-cleanup.sh myhadoop.png core-site.xml pbs-configure.sh hdfs-site.xml setenv.sh mapred-site.xml sge-cleanup.sh myhadoop.png sge-configure.sh Figure 2 Typical myhadoop installation Job submissions in myhadoop will not be successful without a valid Hadoop installation. Hadoop 0.20.2 is the recommended version but any later, stable version will also be applicable provided that appropriate scripts within myhadoop are updated accordingly. Once a valid Hadoop installation is completed, ensure that the following scripts within the myhadoop folder are modified to fit the target system/cluster. On Riptide, this folder is installed on /gpfs/pkgs/mhpcc/myhadoop-<version>. This will be referred to as MY_HADOOP_HOME in the scripts. Configuration 1. setenv.sh Please refer to Figure 2 Typical myhadoop installation. This file is located under the bin folder and is called from pbs-hadoop.sh. This contains the relevant environment variables for setting/running Hadoop. This is file is under /gpfs/pkgs/mhpcc/hadoop/myhadoop-<version>/bin folder and should be updated with the correct values,

Hadoop Implementation on Riptide 6 otherwise, Hadoop will not run or error out. The following are the current settings that may/may not need tweaking: a. export MY_HADOOP_HOME="/gpfs/pkgs/mhpcc/hadoop/myHadoop-0.2a" b. export HADOOP_HOME="/gpfs/pkgs/mhpcc/hadoop/hadoop-0.20.2" c. export HADOOP_DATA_DIR=${WORKDIR}/hadoop/data d. export HADOOP_LOG_DIR=${WORKDIR}/hadoop/log 2. pbs-hadoop.sh This script can probably be considered the main script in executing Hadoop jobs via PBS. This is an entirely new file that was created based on the pbs-example.sh provided by myhadoop. This file is under /gpfs/pkgs/mhpcc/hadoop/ myhadoop-<version> folder. This is the main script that is submitted via PBS. Please make sure to edit this file for Hadoop testing. This script contains sample Hadoop jobs with text files as sample source data. Results of the Hadoop word count are being copied over to $WORKDIR/results.$PBS_JOBID. Therefore, if the Hadoop job is successful, a file "part-r-00000" will be written under this folder. This file should contain the word count results. Just be aware that anything that is placed under $WORKDIR will be purged at a later point. Therefore, make sure copies of the results are copied over to a more persistent location to avoid losing any data. Furthermore, this script contains the pertinent PBS directives that will configure the PBS job when this is invoked. In order to run this via PBS, simply issue the following command: qsub pbs_hadoop.sh. This assumes that the script invocation is started from the MY_HADOOP_HOME root folder. Otherwise, please make the necessary path correction in the command line. SCRIPT CONTENTS/DETAILS a. PBS Directives The following are the PBS directives that are needed to configure a PBS job: #PBS -q standard #PBS -N hadoop_job #PBS -l walltime=00:03:00 #PBS -l select=2:ncpus=1 #PBS -l place=scatter #PBS -o <path>/hadoop_run.out #PBS -j oe #PBS -A <account> #PBS -V Please refer to the PBS documentation at http://www.pbsworks.com/supportdocuments.aspx for a detailed explanation of each directive but here are the basics: -q is the queue where the PBS job is supposed to be -N is the name of the PBS job -l walltime is number of hours/minutes/seconds for the entire job to complete its run -l select is the number nodes while ncpus is the number of cpus per node -o the path and name of the log file -j oe simply tells PBS to combine the output and error logs into; in this instance, combine them into the output log -A the account where the jobs is supposed to run under -V to export all PBS variables mentioned in the above. b. Export HADOOP_CONF_DIR variable

Hadoop Implementation on Riptide 7 This is the base path for all the necessary HADOOP variables. The files under this directory are usually the same files found on $HADOOP_HOME/conf folder with a slight difference. The files: mapred-site.xml, coresite.xml, hdfs-site.xml, and hadoop-env.sh are files taken from the $MY_HADOOP_HOME under etc. Modify these files under $MY_HADOOP_HOME/etc only when really necessary otherwise, the settings should suffice for a simple Hadoop job. c. Set the environment variables At this point, the script calls setenv.sh, which, in turn, sets the pertinent environment variables listed in 1 above1 above d. Set up the cluster configuration At this stage, the script is ready to call pbs-configure.sh. This script expects the following parameters: total number of nodes needed and the $HADOOP_CONF_DIR. e. Format HDFS Once all the necessary configurations are set up, it is now time format the HDFS by issuing the following: $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR namenode -format Since a persistent Hadoop cluster is not the objective of this study, it is deemed necessary that whenever a PBS-Hadoop job is submitted, the HDFS needs to be initialized/formatted. f. Run Hadoop jobs Running Hadoop would mean starting all the necessary daemons. This is accomplished by executing the following: $HADOOP_HOME/bin/start-all.sh Once all the necessary daemons are started, it is time to run some Hadoop like the following: echo "Run some test Hadoop jobs" $HADOOP_HOME/bin/hadoop version echo "creating rundata directory..." $HADOOP_HOME/bin/hadoop dfs -mkdir rundata echo "Copying file from local to HDFS..." $HADOOP_HOME/bin/hadoop dfs -copyfromlocal $HOME/davinci.txt rundata echo "Checking file copied (HDFS)..." $HADOOP_HOME/bin/hadoop dfs -ls rundata/davinci.txt echo "Running wordcount on the file inside HDFS..." $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-examples-1.2.1.jar wordcount rundata/davinci.txt rundata/outputs echo "Checking the wordcount output under Outputs..." $HADOOP_HOME/bin/hadoop dfs -ls rundata/outputs echo "Copying results from HDFS to local FS..." $HADOOP_HOME/bin/hadoop dfs -copytolocal rundata/outputs $WORKDIR/results.$PBS_JOBID echo "Sample Hadoop job finished!" g. Clean up This part just basically removes the HADOOP_LOG_DIR and HADOOP_DATA_DIR used in the run. If results of the Hadoop job run are necessary for later analysis, please make sure to copy the files out of the HDFS prior to finishing the Hadoop run.

Hadoop Implementation on Riptide 8 3. pbs-configure.sh This file resides under myhadoop-<version>/bin. This script, as its name implies, sets up the necessary environment variables and file configuration for Hadoop to work. 4. hadoop-env.sh This is the only file that needs editing in the Hadoop installation folder. This file resides under <hadoop installation>/conf folder. The only line that is required here is the JAVA_HOME. Set it to the correct JAVA_HOME and it's good to go. Testing Once all the necessary scripts have been edited and updated with the correct values. The framework is now ready for testing. Run some Hadoop jobs and make sure to Check the log location under HADOOP_LOG_DIR and the "hadoop_run.out" file under $WORKDIR.