Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage



Similar documents
October Gluster Virtual Storage Appliance User Guide

Release: August Gluster Filesystem Unified File and Object Storage Beta 2

An Introduction To Gluster. Ryan Matteson

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

EMC NetWorker Module for Microsoft Applications Release 2.3. Application Guide P/N REV A02

How To Install Hadoop From Apa Hadoop To (Hadoop)

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

HDFS Users Guide. Table of contents

Load Balancing and High availability using CTDB + DNS round robin

EMC Avamar. Backup Clients User Guide. Version REV 02

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to HDFS. Prasanth Kothuri, CERN

Apache Hadoop. Alexandru Costan

Accelerating and Simplifying Apache

Introduction to Gluster. Versions 3.0.x

L1: Introduction to Hadoop

Release Notes for McAfee(R) VirusScan(R) Enterprise for Linux Version Copyright (C) 2014 McAfee, Inc. All Rights Reserved.

EMC NetWorker Module for Microsoft Exchange Server Release 5.1

Introduction to Highly Available NFS Server on scale out storage systems based on GlusterFS

vsphere Upgrade vsphere 6.0 EN

Installation Guide. McAfee VirusScan Enterprise for Linux Software

Postgres Enterprise Manager Installation Guide

THE HADOOP DISTRIBUTED FILE SYSTEM

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Apache Hadoop new way for the company to store and analyze big data

EMC Avamar 7.2 for IBM DB2

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Red Hat Storage Server Administration Deep Dive

HADOOP MOCK TEST HADOOP MOCK TEST I

Postgres Plus xdb Replication Server with Multi-Master User s Guide

How To Use A Microsoft Networker Module For Windows (Windows) And Windows 8 (Windows 8) (Windows 7) (For Windows) (Powerbook) (Msa) (Program) (Network

QuickBooks Enterprise Solutions. Linux Database Server Manager Installation and Configuration Guide

NFS Ganesha and Clustered NAS on Distributed Storage System, GlusterFS. Soumya Koduri Meghana Madhusudhan Red Hat

HADOOP MOCK TEST HADOOP MOCK TEST II

Supported Platforms HPE Vertica Analytic Database. Software Version: 7.2.x

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

HDFS Cluster Installation Automation for TupleWare

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

McAfee Asset Manager Console

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

EMC NetWorker Module for Microsoft for Windows Bare Metal Recovery Solution

Spectrum Scale HDFS Transparency Guide

Introduction to HDFS. Prasanth Kothuri, CERN

TP1: Getting Started with Hadoop

VMware Data Recovery. Administrator's Guide EN

Hadoop Basics with InfoSphere BigInsights

Intuit QuickBooks Enterprise Solutions. Linux Database Server Manager Installation and Configuration Guide

CSE-E5430 Scalable Cloud Computing Lecture 2

Installing Management Applications on VNX for File

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

VMware vsphere Big Data Extensions Administrator's and User's Guide

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Hadoop Setup Walkthrough

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Tutorial for Assignment 2.0

Hadoop IST 734 SS CHUNG

Planning and Administering Windows Server 2008 Servers

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

In Memory Accelerator for MongoDB

How To Use Hadoop

Hadoop Multi-node Cluster Installation on Centos6.6

Setting up VMware Server v1 for 2X VirtualDesktopServer Manual

Hadoop Architecture. Part 1

GlusterFS Distributed Replicated Parallel File System

Database Replication Error in Cisco Unified Communication Manager

EMC Documentum Content Services for SAP Repository Manager

Tutorial for Assignment 2.0

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Configuring Nex-Gen Web Load Balancer

File S1: Supplementary Information of CloudDOE

Using Actian PSQL as a Data Store with VMware vfabric SQLFire. Actian PSQL White Paper May 2013

DameWare Server. Administrator Guide

External Data Connector (EMC Networker)

Single Node Hadoop Cluster Setup

EMC Data Protection Search


Oracle WebLogic Server 11g Administration

XtreemFS Extreme cloud file system?! Udo Seidel

Assignment # 1 (Cloud Computing Security)

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

Symantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0

ovirt and Gluster Hyperconvergence

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop Distributed File System Propagation Adapter for Nimbus

McAfee SMC Installation Guide 5.7. Security Management Center

EMC NetWorker VSS Client for Microsoft Windows Server 2003 First Edition

PARALLELS SERVER BARE METAL 5.0 README

Using Oracle NoSQL Database

HADOOP CLUSTER SETUP GUIDE:

vsphere Upgrade Update 1 ESXi 6.0 vcenter Server 6.0 EN

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Administrator s Guide

INUVIKA TECHNICAL GUIDE

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Transcription:

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage Release: August 2011

Copyright Copyright 2011 Gluster, Inc. This is a preliminary document and may be changed substantially prior to final commercial release of the software described herein. Hadoop Compatible Storage Pg No. 2

Table of Contents 1. About this Guide... 4 1.1. Disclaimer... 4 1.2. Audience... 4 1.3. Prerequisite... 4 1.4. Terms... 4 1.5. Typographical Conventions... 5 1.6. Feedback... 5 2. Introducing Hadoop Compatible Storage of GlusterFS... 6 2.1. Architecture Overview... 6 2.2. Advantages... 6 3. Preparing to Install Hadoop Compatible Storage... 7 3.1. Pre-requisites... 7 3.2. Dependencies... 7 4. Installing and Configuring Hadoop Compatible Storage... 8 5. Starting and Stopping the Hadoop MapReduce Daemon on GlusterFS... 11 5.1. Starting and Stopping MapReduce Daemon... 11 6. Troubleshooting Hadoop Compatible Storage... 12 6.1. Time Sync... 12 6.2. Socket Creation Errors... 12 7. Creating GlusterFS Volumes... 13 7.1. Creating Distributed Striped Replicated Volumes... 13 7.2. Creating Striped Replicated Volumes... 14 8. Managing Your Gluster Filesystem... 15 Hadoop Compatible Storage Pg No. 3

1. About this Guide This guide describes Gluster Hadoop Compatible Storage feature and its installation and management. 1.1. Disclaimer Gluster, Inc. has designated English as the official language for all of its product documentation and other documentation, as well as all our customer communications. All documentation prepared or delivered by Gluster will be written, interpreted and applied in English, and English is the official and controlling language for all our documents, agreements, instruments, notices, disclosures and communications, in any form, electronic or otherwise (collectively, the Gluster Documents ). Any customer, vendor, partner or other party who requires a translation of any of the Gluster Documents is responsible for preparing or obtaining such translation, including associated costs. However, regardless of any such translation, the English language version of any of the Gluster Documents prepared or delivered by Gluster shall control for any interpretation, enforcement, application or resolution. 1.2. Audience This guide is intended for Apache Hadoop users interested in using GlusterFS as filesystem for Hadoop. 1.3. Prerequisite This document assumes that you are familiar with the Linux operating system, concepts of File System, GlusterFS concepts, Apache Hadoop, and MapReduce framework. 1.4. Terms Term master slave job map reduce Description Master manages scheduling of jobs, assigns tasks to slaves, monitors tasks and re-executes the failed tasks. Program which submits a job to the master. A set of map and/or reduce tasks, coordinated by the master. When the master receives a job, it assigns a unique name for the job, and assigns the tasks to workers until they are all completed. The first phase of a job, in which tasks are usually scheduled on the same node where their input data is hosted, so that local computation can be performed. Generally there is one map task per input. Individual task in this phase, which usually has access to all values for a given key produced by the map phase. Hadoop Compatible Storage Pg No. 4

Term Description mapreduce task worker A paradigm and associated framework for distributed computing, which decouples application code from the core challenges of fault tolerance and data locality. A task is essentially a unit of work, provided to a worker. A worker is responsible for carrying out a task. A job specifies the executable that is the worker. Workers are scheduled to run on the nodes, close to the data they are supposed to be processing. 1.5. Typographical Conventions The following table lists the formatting conventions that are used in this guide to make it easier for you to recognize and use specific types of information. Convention Description Example Courier Text Italicized Text Commands formatted as courier indicate shell commands. Within a command, italicized text represents variables, which must be substituted with specific values. gluster volume start volname gluster volume start volname Square Brackets Curly Brackets Within a command, optional parameters are shown in square brackets. Within a command, alternative parameters are grouped within curly brackets and separated by the vertical OR bar. gluster volume start volname [force] gluster volume { start stop delete } volname 1.6. Feedback Gluster welcomes your comments and suggestions on the quality and usefulness of its documentation. If you find any errors or have any other suggestions, write to us at docfeedback@gluster.com for clarification and provide the chapter, section, and page number, if available. Gluster offers a range of resources related to Gluster software: Discuss technical problems and solutions on the Discussion Forum (http://community.gluster.org) Get hands-on step-by-step tutorials (http://www.gluster.com/community/documentation/index.php/main_page) Hadoop Compatible Storage Pg No. 5

2. Introducing Hadoop Compatible Storage of GlusterFS GlusterFS 3.3 beta 2 includes compatibility for Apache Hadoop and it uses the standard file system APIs available in Hadoop to provide a new storage option for Hadoop deployments. Existing MapReduce based applications can use GlusterFS seamlessly. This new functionality opens up data within Hadoop deployments to any file-based or object-based application. A MapReduce framework typically divides the input data-set into independent tasks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the jobs are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and reexecuting the failed tasks. 2.1. Architecture Overview The following diagram illustrates Hadoop integration with Gluster: 2.2. Advantages The following are the advantages of Hadoop Compatible Storage with GlusterFS: Provides simultaneous file-based and object-based access within Hadoop. Eliminates the centralized metadata server. Provides compatibility with MapReduce applications and rewrite is not required. Provides a fault tolerant filesystem. Hadoop Compatible Storage Pg No. 6

3. Preparing to Install Hadoop Compatible Storage This section provides information on pre-requisites and list of dependencies that will be installed during installation of Hadoop compatible storage. 3.1. Pre-requisites The following are the pre-requisites to install and configure GlusterFS with Hadoop Compatible Storage: Hadoop 0.20.2 is installed, configured, and is running on all the machines in the cluster. Java Runtime Environment Maven (mandatory only if you are building the plugin from the source) JDK (mandatory only if you are building the plugin from the source) Source code is available at https://github.com/gluster/hadoop-glusterfs. 3.2. Dependencies The following package will be installed when you install Hadoop Compatible Storage on Gluster: getfattr Hadoop Compatible Storage Pg No. 7

4. Installing and Configuring Hadoop Compatible Storage This section describes how to install and configure Hadoop Compatible Storage in your storage environment and verify that it is functioning correctly. 1. Download the GlusterFS RPM files on all servers in your cluster. You can download the software at http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/. 2. For each RPM file, get the md5sum (using the following command) and compare it against the md5sum file available at http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3- beta-2/centos/. $ md5sum RPM_file.rpm 3. Install GlusterFS on all servers using the following commands: # rpm -Uvh core_rpm_file # rpm -Uvh fuse_rpm_file # rpm -Uvh geo-replication_rpm_file For example: # rpm -Uvh glusterfs-core-3.3beta2-1.x86_64.rpm # rpm -Uvh glusterfs-fuse-3.3beta2-1.x86_64.rpm # rpm -Uvh glusterfs-geo-replication-3.3beta2-1.x86_64.rpm 4. Verify that 3.3beta2 version of GlusterFS is installed, using the following command: # glusterfs version For more information on installing GlusterFS, refer to GlusterFS Installation at http://www.gluster.com/community/documentation/index.php/gluster_3.2_filesystem_installa tion_guide 5. Download the glusterfs-hadoop-0.20.2-0.1.x86_64.rpm on all servers in your cluster. You can download the software at http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3- beta-2/. 6. To install Hadoop Compatible Storage on all servers in your cluster, run the following command: # rpm ivh --nodpes glusterfs-hadoop-0.20.2-0.1.x86_64.rpm The following files will be extracted: /usr/local/lib/glusterfs-<hadoop-version>-<gluster-plugin-version>.jar - /usr/local/lib/conf/core-site.xml 7. (Optional) To install Hadoop Compatible Storage in a different location, run the following command: # rpm ivh --nodpes prefix /usr/local/glusterfs/hadoop glusterfs-hadoop- 0.20.2-0.1.x86_64.rpm Hadoop Compatible Storage Pg No. 8

8. Edit the conf/core-site.xml file. The following is the sample conf/core-site.xml file: <configuration> <property> <name>fs.glusterfs.impl</name> <value>org.apache.hadoop.fs.glusterfs.glusterfilesystem</value> </property> <property> <name>fs.default.name</name> <value>glusterfs://fedora1:9000</value> </property> <property> <name>fs.glusterfs.volname</name> <value>hadoopvol</value> </property> <property> <name>fs.glusterfs.mount</name> <value>/mnt/glusterfs</value> </property> <property> <name>fs.glusterfs.server</name> <value> fedora2</value> </property> <property> <name>quick.slave.io</name> <value>off</value> </property> </configuration> The following are the configurable fields: Property Name Default Value Description fs.default.name glusterfs://fedora1:9000 Any hostname in the cluster as the server and any port number. fs.glusterfs.volname hadoopvol GlusterFS volume to mount. fs.glusterfs.mount /mnt/glusterfs The directory used to fuse mount the volume. fs.glusterfs.server fedora2 Any hostname or IP address on the cluster except the client/master. Hadoop Compatible Storage Pg No. 9

Property Name Default Value Description quick.slave.io Off Performance tunable option. If this option is set to On, the plugin will try to perform I/O directly from the disk filesystem (like ext3 or ext4) the file resides on. Hence read performance will improve and job would run faster. Note: This option is not tested widely. 9. Create a soft link in Hadoop s library and configuration directory for the downloaded files (in Step 7) using the following commands: # ln -s <target location> <source location> For example, # ln s /usr/local/lib/glusterfs-0.20.2-0.1.jar $HADOOP_HOME/lib/glusterfs-0.20.2-0.1.jar # ln s /usr/local/lib/conf/core-site.xml $HADOOP_HOME/conf/core-site.xml 10. (Optional) You can run the following command on Hadoop master to build the plugin and deploy it along with core-site.xml file, instead of repeating the above steps: # build-deploy-jar.py -d $HADOOP_HOME -c Hadoop Compatible Storage Pg No. 10

5. Starting and Stopping the Hadoop MapReduce Daemon on GlusterFS The MapReduce daemon serves to run MapReduce jobs on Gluster. Note: You must start Hadoop MapReduce daemon on all servers. 5.1. Starting and Stopping MapReduce Daemon To start MapReduce daemon manually, enter the following command: # $HADOOP_HOME/bin/start-mapred.sh To stop MapReduce daemon manually, enter the following command: # $HADOOP_HOME/bin/stop-mapred.sh Hadoop Compatible Storage Pg No. 11

6. Troubleshooting Hadoop Compatible Storage This section describes the most common troubleshooting issues related to Hadoop Compatible Storage. 6.1. Time Sync Running MapReduce job may throw exceptions if the time is out-of-sync on the hosts in the cluster. Solution: Sync the time on all hosts using ntpd program. 6.2. Socket Creation Errors The CLI commands may not work with Centos 5.x, gluster ipv6 module and the socket creation may fail (error message will be logged in the log file). Solution: Edit the following line in /etc/modprobe.conf file from options ipv6 disable=1 to options ipv6 disable=0 and reboot the machines. Hadoop Compatible Storage Pg No. 12

7. Creating GlusterFS Volumes From GlusterFS 3.3 beta 2 onwards, you can create volumes of the following types in your storage environment: Distributed Striped Replicated Distributes and stripes data across replicated bricks in the volume. For more information, see Creating Distributed Striped Replicated Volumes. Striped Replicated Stripes and replicates data across bricks in the volume. For more information, see Creating Striped Replicated Volumes 7.1. Creating Distributed Striped Replicated Volumes Distributed striped replicated volumes distributes and stripes data across replicated bricks in the cluster. For best results, you should use distributed striped replicated volumes where the requirement is to scale storage, high concurrency environments accessing very large files, and performance is critical. To configure a distributed striped replicated volume 1. Create a trusted storage pool consisting of the storage servers that will comprise the volume. For information on creating trusted storage pool, see http://www.gluster.com/community/documentation/index.php/gluster_3.2:_adding_servers_to _Trusted_Storage_Pool. 2. Create the volume using the following command: Note: The number of bricks should be a multiples of number of stripe count and replica count for a distributed striped replicated volume. # gluster volume create NEW-VOLNAME [stripe COUNT] [replica COUNT] [transport tcp rdma tcp,rdma] NEW-BRICK... To create a distributed replicated striped volume across eight storage servers: # gluster volume create test-volume stripe 2 replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5 server6:/exp6 server7:/exp7 server8:/exp8 Creation of test-volume has been successful Please start the volume to access data. (Optional) Set additional options if required, such as auth.allow or auth.reject. For example: # gluster volume set test-volume auth.allow 10.* Note: Make sure you start your volumes before you try to mount them or else client operations after the mount will hang. For information on starting volumes, see http://www.gluster.com/community/documentation/index.php/gluster_3.2:_starting_volumes. Hadoop Compatible Storage Pg No. 13

7.2. Creating Striped Replicated Volumes Stripes data across replicated bricks in the cluster. For best results, you should use striped replicated volumes where the requirement is high concurrency environments accessing very large files and performance is critical. To configure a striped replicated volume 1. Create a trusted storage pool consisting of the storage servers that will comprise the volume. For information on creating trusted storage pool, see http://www.gluster.com/community/documentation/index.php/gluster_3.2:_adding_servers_to _Trusted_Storage_Pool. 2. Create the volume using the following command: Note: The number of bricks should be a multiple of the replicate count and stripe count for a striped replicated volume. # gluster volume create NEW-VOLNAME [stripe COUNT] [replica COUNT] [transport tcp rdma tcp,rdma] NEW-BRICK... To create a striped replicated volume across four storage servers: # gluster volume create test-volume stripe 2 replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 Creation of test-volume has been successful Please start the volume to access data. To create a striped replicated volume across six storage servers: # gluster volume create test-volume stripe 3 replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5 server6:/exp6 Creation of test-volume has been successful Please start the volume to access data. 3. (Optional) Set additional options if required, such as auth.allow or auth.reject. For example: # gluster volume set test-volume auth.allow 10.* Note: Make sure you start your volumes before you try to mount them or else client operations after the mount will hang. For information on starting volumes, see http://www.gluster.com/community/documentation/index.php/gluster_3.2:_starting_volumes. Hadoop Compatible Storage Pg No. 14

8. Managing Your Gluster Filesystem The GlusterFS Administration Guide is available at: http://www.gluster.com/community/documentation/index.php/gluster_3.2_filesystem_administrat ion_guide. Hadoop Compatible Storage Pg No. 15