RDMA for Apache Hadoop 0.9.9 User Guide



Similar documents
RDMA for Apache Hadoop 2.x User Guide

Hadoop Shell Commands

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

HDFS File System Shell Guide

Hadoop Shell Commands

File System Shell Guide

Hadoop (pseudo-distributed) installation and configuration

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

How To Install Hadoop From Apa Hadoop To (Hadoop)

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

TP1: Getting Started with Hadoop

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

HDFS Installation and Shell

Apache Hadoop new way for the company to store and analyze big data

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Installation and Configuration Documentation

Data Science Analytics & Research Centre

Single Node Setup. Table of contents

MapReduce Evaluator: User Guide

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

HSearch Installation

How To Use Hadoop

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Distributed Filesystems

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Deploying Apache Hadoop with Colfax and Mellanox VPI Solutions

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

HADOOP - MULTI NODE CLUSTER

HADOOP MOCK TEST HADOOP MOCK TEST II

HDFS Users Guide. Table of contents

Introduction to Cloud Computing

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Running Kmeans Mapreduce code on Amazon AWS

Big Data Technologies

Spectrum Scale HDFS Transparency Guide

Hadoop Setup Walkthrough

HADOOP MOCK TEST HADOOP MOCK TEST

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Extreme computing lab exercises Session one

Introduction to HDFS. Prasanth Kothuri, CERN

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Introduction to HDFS. Prasanth Kothuri, CERN

Enabling High performance Big Data platform with RDMA

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Hadoop on the Gordon Data Intensive Cluster

Installing Hadoop over Ceph, Using High Performance Networking

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Extreme computing lab exercises Session one

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

MapReduce. Tushar B. Kute,

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to MapReduce and Hadoop

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Single Node Hadoop Cluster Setup

5 HDFS - Hadoop Distributed System

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Apache Hadoop. Alexandru Costan

2.1 Hadoop a. Hadoop Installation & Configuration

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

HDFS Architecture Guide

Accelerating Big Data Processing with Hadoop, Spark and Memcached

Hadoop Lab - Setting a 3 node Cluster. Java -

Hadoop and Spark Tutorial for Statisticians

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

GraySort and MinuteSort at Yahoo on Hadoop 0.23

HADOOP CLUSTER SETUP GUIDE:

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system THESIS

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

HADOOP MOCK TEST HADOOP MOCK TEST I

Hadoop MultiNode Cluster Setup

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Accelerating and Simplifying Apache

Hadoop Training Hands On Exercise

docs.hortonworks.com

Big Data 2012 Hadoop Tutorial

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Hadoop Multi-node Cluster Installation on Centos6.6

Hadoop Forensics. Presented at SecTor. October, Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Performance measurement of a Hadoop Cluster

Transcription:

0.9.9 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c) 2011-2014 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: August 19, 2014

Contents 1 Overview of the RDMA for Apache Hadoop Project 1 2 Features 1 3 Installation Instructions 2 3.1 Prerequisites.......................................... 2 3.2 Download........................................... 2 3.3 Basic Configuration...................................... 2 3.4 Advanced Configuration................................... 5 4 Basic Usage Instructions 6 4.1 Startup............................................. 6 4.2 Basic Commands....................................... 6 4.3 Shutdown........................................... 9 5 Benchmarks 9 5.1 TestDFSIO.......................................... 9 5.2 Sort.............................................. 9 5.3 TeraSort............................................ 10 i

1 Overview of the RDMA for Apache Hadoop Project RDMA for Apache Hadoop is a high-performance design of Hadoop over RDMA-enabled Interconnects. This version of RDMA for Apache Hadoop 0.9.9 is based on Apache Hadoop 1.2.1. This file is intended to guide users through the various steps involved in installing, configuring, and running RDMA for Apache Hadoop over InfiniBand. If there are any questions, comments or feedback regarding this software package, please post them to rdma-hadoop-discuss mailing list (rdma-hadoop-discuss@cse.ohio-state.edu). 2 Features High-level features of RDMA for Apache Hadoop 0.9.9 are listed below. Based on Apache Hadoop 1.2.1 High performance design with native InfiniBand and RoCE support at the verbs level for HDFS, MapReduce, and RPC components Compliant with Apache Hadoop 1.2.1 APIs and applications Easily configurable for native InfiniBand, RoCE, and the traditional sockets based support (Ethernet and InfiniBand with IPoIB) On-demand connection setup HDFS over native InfiniBand and RoCE RDMA-based write RDMA-based replication Parallel replication support MapReduce over native InfiniBand and RoCE RDMA-based shuffle Prefetching and caching of map output In-memory merge Advanced optimization in overlapping map, shuffle, and merge shuffle, merge, and reduce RPC over native InfiniBand and RoCE JVM-bypassed buffer management RDMA or send/recv based adaptive communication Intelligent buffer allocation and adjustment for serialization 1

Tested with Native Verbs-level support with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms Different file systems with disks and SSDs 3 Installation Instructions 3.1 Prerequisites Prior to the installation of RDMA for Apache Hadoop, please ensure that you have the latest version of JDK installed on your system, and set the JAVA HOME and PATH environment variables to point to the appropriate JDK installation. We recommend the use of JDK version 1.7 and later. In order to use the RDMA-based features provided with RDMA for Apache Hadoop, install the latest version of the OFED distribution that can be obtained from http://www.openfabrics.org. 3.2 Download Download the most recent distribution tarball from http://hibd.cse.ohio-state.edu/download/ hibd/rdma-hadoop-0.9.9-bin.tar.gz. 3.3 Basic Configuration Most of the configuration files are in the directory rdma-hadoop-0.9.9/conf. Steps to configure RDMA for Apache Hadoop include: 1. Configure hadoop-env.sh file. export JAVA_HOME=/opt/java/1.7.0 2. Configure core-site.xml file. RDMA for Apache Hadoop 0.9.9 supports three different modes: IB, RoCE, and TCP/IP. Configuration of the IB mode: <name>fs.default.name</name> <value>hdfs://node001:9000</value> <description>url of NameNode</description> 2

<name>hadoop.ib.enabled</name> <value>true</value> <description>enable the RDMA feature over IB. Default value of hadoop.ib.enabled is true.</description> <name>hadoop.roce.enabled</name> <value>false</value> <description>disable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description> Configuration of the RoCE mode: <name>fs.default.name</name> <value>hdfs://node001:9000</value> <description>url of NameNode</description> <name>hadoop.ib.enabled</name> <value>false</value> <description>disable the RDMA feature over IB. Default value of hadoop.ib.enabled is true.</description> <name>hadoop.roce.enabled</name> <value>true</value> <description>enable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description> Configuration of the TCP/IP mode: <name>fs.default.name</name> <value>hdfs://node001:9000</value> <description>url of NameNode</description> 3

<name>hadoop.ib.enabled</name> <value>false</value> <description>disable the RDMA feature over IB. Default value of hadoop.ib.enabled is true.</description> <name>hadoop.roce.enabled</name> <value>false</value> <description>disable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description> Note that we should not enable hadoop.ib.enabled and hadoop.roce.enabled at the same time. 3. Configure hdfs-site.xml file. <name>dfs.name.dir</name> <value>/home/hadoop/rdma-hadoop-0.9.9/hadoopname</value> <description>path on the local filesystem where the NameNode stores the namespace and transactions logs</description> <name>dfs.data.dir</name> <value>/data01,/data02</value> <description>lists of paths on the local filesystem of a DataNode where it should store its block</description> 4. Configure mapred-site.xml file. <name>mapred.job.tracker</name> <value>node001:9001</value> <description>host or IP and port of JobTracker</description> <name>mapred.system.dir</name> <value>/hadoop/mapred/system/</value> 4

<description>path on the HDFS where the MapReduce framework stores system files</description> 5. Configure masters file. Write the master hostname in this file. node001 6. Configure slaves file. List all slave hostnames in this file. node002 node003 We can also configure more specific items according to actual needs. For example, we can configure the item dfs.block.size in hdfs-site.xml to change the HDFS block size. To get more detailed information, please visit http://hadoop.apache.org. 3.4 Advanced Configuration Some advanced features in RDMA for Apache Hadoop 0.9.9 can be manually enabled by users. Steps to configure these features in RDMA for Apache Hadoop 0.9.9 include: 1. Enable parallel replication in HDFS by configuring hdfs-site.xml file. By default, RDMA for Apache Hadoop 0.9.9 will choose the pipeline replication mechanism. <name>dfs.replication.parallel</name> <value>true</value> <description>enable the parallel replication feature in HDFS. Default value of dfs.replication.parallel is false. </description> 2. Enable disk-based shuffle in MapReduce by configuring mapred-site.xml file. By default, RDMA for Apache Hadoop 0.9.9 assumes that disk-based shuffle is enabled. We encourage our users to disable this parameter if they feel that the local disk performance in their cluster is not good enough. <name>mapred.disk.shuffle.enabled</name> <value>true</value> <description>enable disk-based shuffle in MapReduce. Default value of mapred.disk.shuffle.enabled is true. </description> 5

4 Basic Usage Instructions RDMA for Apache Hadoop 0.9.9 has management operations similar to default Apache Hadoop 1.2.1. This section lists several of them for basic usage. 4.1 Startup 1. Use the following command to format the directory which stores the namespace and transactions logs for NameNode. $ bin/hadoop namenode -format 2. Start HDFS with the following command: $ bin/start-dfs.sh 3. Start MapReduce with the following command: $ bin/start-mapred.sh 4. Start HDFS+MapReduce with the following command: $ bin/start-all.sh 4.2 Basic Commands 1. Use the following command to manage HDFS: $ bin/hadoop dfsadmin Usage: java DFSAdmin [-report] [-safemode enter leave get wait] [-savenamespace] [-refreshnodes] [-finalizeupgrade] [-upgradeprogress status details force] [-metasave filename] [-refreshserviceacl] [-refreshusertogroupsmappings] [-refreshsuperusergroupsconfiguration] [-setquota <quota> <dirname>...<dirname>] [-clrquota <dirname>...<dirname>] [-setspacequota <quota> <dirname>...<dirname>] [-clrspacequota <dirname>...<dirname>] [-setbalancerbandwidth <bandwidth in bytes per second>] [-help [cmd]] 6

Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local namenode:port> specify a namenode -jt <local jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. For example, we often use the following command to show the status of HDFS: $ bin/hadoop dfsadmin -report 2. Use the following command to manage files in HDFS: $ bin/hadoop fs Usage: java FsShell [-ls <path>] [-lsr <path>] [-du <path>] [-dus <path>] [-count[-q] <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm [-skiptrash] <path>] [-rmr [-skiptrash] <path>] [-expunge] [-put <localsrc>... <dst>] [-copyfromlocal <localsrc>... <dst>] [-movefromlocal <localsrc>... <dst>] [-get [-ignorecrc] [-crc] <src> <localdst>] [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-text <src>] [-copytolocal [-ignorecrc] [-crc] <src> <localdst>] [-movetolocal [-crc] <src> <localdst>] [-mkdir <path>] [-setrep [-R] [-w] <rep> <path/file>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-tail [-f] <file>] 7

[-chmod [-R] <MODE[,MODE]... OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-help [cmd]] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local namenode:port> specify a namenode -jt <local jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. For example, we can use the following command to list directory contents of HDFS: $ bin/hadoop fs -ls / 3. Use the following command to interoperate with the MapReduce framework: $ bin/hadoop job Usage: JobClient <command> <args> [-submit <job-file>] [-status <job-id>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-set-priority <job-id> <priority>]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW [-events <job-id> <from-event-#> <#-of-events>] [-history <joboutputdir>] [-list [all]] [-list-active-trackers] [-list-blacklisted-trackers] [-list-attempt-ids <job-id> <task-type> <task-state>] [-kill-task <task-id>] [-fail-task <task-id>] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local namenode:port> specify a namenode 8

-jt <local jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. For example, we can use the following command to list all active trackers of MapReduce: $ bin/hadoop job -list-active-trackers 4.3 Shutdown 1. Stop HDFS with the following command: $ bin/stop-dfs.sh 2. Stop MapReduce with the following command: $ bin/stop-mapred.sh 3. Stop HDFS+MapReduce with the following command: $ bin/stop-all.sh 5 Benchmarks 5.1 TestDFSIO The TestDFSIO benchmark is used to measure I/O performance of HDFS. It does this by using a MapReduce job to read or write files in parallel. Each file is read or written in a separate map task and the benchmark reports the average read/write throughput per map. On a client node, the TestDFSIO write experiment can be run using the following command: $ bin/hadoop jar hadoop-test-*.jar TestDFSIO -write -nrfiles 2 -filesize 1000 This command writes 2 files 1,000MB each. 5.2 Sort The Sort benchmark uses the MapReduce framework to sort the input directory into the output directory. The inputs and outputs are sequence files where the keys and values are BytesWritable. Before running the 9

Sort benchmark, we can use RandomWriter to generate the input data. RandomWriter writes random data to HDFS using the MapReduce framework. Each map takes a single file name as input and writes random BytesWritable keys and values to the HDFS sequence file. On a client node, the RandomWriter experiment can be run using the following command: $ bin/hadoop jar hadoop-examples-*.jar randomwriter <out-dir> [<configuration file>] More details about configurations can be found in: http://wiki.apache.org/hadoop/randomwriter. On a client node, the Sort experiment can be run using the following command: $ bin/hadoop jar hadoop-examples-*.jar sort [-m <#maps>] [-r <#reduces>] <in-dir> <out-dir> The in-dir of Sort can be the out-dir of RandomWriter. More details can be found in: http://wiki.apache.org/hadoop/sort. 5.3 TeraSort TeraSort is probably the most well-known Hadoop benchmark. It is a benchmark that combines testing the HDFS and MapReduce layers of a Hadoop cluster. The input data for TeraSort can be generated by the TeraGen tool, which writes the desired number of rows of data in the input directory. By default, the key and value size is fixed for this benchmark at 100 bytes. TeraSort takes the data from the input directory and sorts it to another directory. The output of TeraSort can be validated by the TeraValidate tool. Before running the TeraSort benchmark, we can use TeraGen to generate the input data as follows: $ bin/hadoop jar hadoop-examples-*.jar teragen <number of 100-byte rows> <output dir> On a client node, the TeraSort experiment can be run using the following command: $ bin/hadoop jar hadoop-examples-*.jar terasort <input dir> <output dir> The in-dir of TeraSort can be the out-dir of TeraGen. 10