Revolution R Enterprise 7 Hadoop Configuration Guide



Similar documents
Revolution R Enterprise 7 Hadoop Configuration Guide

RHadoop Installation Guide for Red Hat Enterprise Linux

CDH 5 Quick Start Guide

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

RevoScaleR 7.3 HPC Server Getting Started Guide

Revolution R Enterprise DeployR 7.1 Installation Guide for Windows

How To Install Hadoop From Apa Hadoop To (Hadoop)

Cloudera Manager Training: Hands-On Exercises

VMware vsphere Big Data Extensions Administrator's and User's Guide

Centrify Identity and Access Management for Cloudera

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Symantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0

HSearch Installation

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

H2O on Hadoop. September 30,

Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster

Hadoop Installation. Sandeep Prasad

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Hadoop Basics with InfoSphere BigInsights

Important Notice. (c) Cloudera, Inc. All rights reserved.

docs.hortonworks.com

Red Hat Enterprise Linux OpenStack Platform 7 OpenStack Data Processing

JAMF Software Server Installation Guide for Linux. Version 8.6

CA Performance Center

Quick Start Guide For Ipswitch Failover v9.0

Using The Hortonworks Virtual Sandbox

HYPERION SYSTEM 9 N-TIER INSTALLATION GUIDE MASTER DATA MANAGEMENT RELEASE 9.2

Architecting the Future of Big Data

Introduction to HDFS. Prasanth Kothuri, CERN

MapReduce. Tushar B. Kute,

Hadoop Training Hands On Exercise

Cloudera Backup and Disaster Recovery

HDFS Users Guide. Table of contents

A Study of Data Management Technology for Handling Big Data

Configuring Hadoop Security with Cloudera Manager

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Pivotal HD Enterprise

LOCKSS on LINUX. CentOS6 Installation Manual 08/22/2013

Partek Flow Installation Guide

Important Notice. (c) Cloudera, Inc. All rights reserved.

Single Node Hadoop Cluster Setup

Actian Vortex Express 3.0

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

docs.hortonworks.com

Ankush Cluster Manager - Hadoop2 Technology User Guide

Document Type: Best Practice

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Hadoop Installation MapReduce Examples Jake Karnes

Deploying Hadoop with Manager

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x

Hadoop Lab - Setting a 3 node Cluster. Java -

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

NovaBACKUP xsp Version 15.0 Upgrade Guide

CA Performance Center

SAS Data Loader 2.1 for Hadoop

PrimeRail Installation Notes Version A June 9,

9.4 Hadoop Configuration Guide for Base SAS. and SAS/ACCESS

Virtual Managment Appliance Setup Guide

A. Aiken & K. Olukotun PA3

Revolution R Enterprise DeployR 7.1 Enterprise Security Guide. Authentication, Authorization, and Access Controls

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop MultiNode Cluster Setup

Novell Open Workgroup Suite

docs.hortonworks.com

Integrating VoltDB with Hadoop

How To Use Hadoop

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

An Oracle White Paper September Oracle WebLogic Server 12c on Microsoft Windows Azure

Qsoft Inc

JAMF Software Server Installation Guide for Windows. Version 8.6

Installation Guide. McAfee VirusScan Enterprise for Linux Software

Configuring MailArchiva with Insight Server

How to install PowerChute Network Shutdown on VMware ESXi 3.5, 4.0 and 4.1

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.1.x

Architecting the Future of Big Data

Virtual Web Appliance Setup Guide

HADOOP MOCK TEST HADOOP MOCK TEST

Release Notes for McAfee(R) VirusScan(R) Enterprise for Linux Version Copyright (C) 2014 McAfee, Inc. All Rights Reserved.

Platfora Deployment Planning Guide

Introduction to HDFS. Prasanth Kothuri, CERN

TP1: Getting Started with Hadoop

Supported Platforms HPE Vertica Analytic Database. Software Version: 7.2.x

HP AppPulse Active. Software Version: 2.2. Real Device Monitoring For AppPulse Active

Cloud.com CloudStack Community Edition 2.1 Beta Installation Guide

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

DocuShare Installation Guide

Centrify Server Suite For MapR 4.1 Hadoop With Multiple Clusters in Active Directory

HDFS Installation and Shell

Platfora Installation Guide

Transcription:

Revolution R Enterprise 7 Hadoop Configuration Guide

The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2014. Revolution R Enterprise 7 Hadoop Configuration Guide. Revolution Analytics, Inc., Mountain View, CA. Revolution R Enterprise 7 Hadoop Configuration Guide Copyright 2014 Revolution Analytics, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of Revolution Analytics. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Revolution R, Revolution R Enterprise, RPE, RevoScaleR, DeployR, RevoTreeView, and Revolution Analytics are trademarks of Revolution Analytics. Other product names mentioned herein are used for identification purposes only and may be trademarks of their respective owners. Revolution Analytics 2570 West El Camino Real Suite 222 Mountain View, CA 94040 U.S.A. We want our documentation to be useful, and we want it to address your needs. If you have comments on this or any Revolution document, write to doc@revolutionanalytics.com.

Table of Contents 1 Introduction... 1 1.1 System Requirements... 1 1.2 Basic Hadoop Terminology... 1 1.3 Verifying the Hadoop Installation... 2 1.4 Adjusting Hadoop Memory Limits (Hadoop 2.x Systems Only)... 3 2 Hadoop Security with Kerberos Authentication... 4 3 Installing Revolution R Enterprise on a Cluster... 5 3.1 Standard Command Line Install... 5 3.2 Distributed Installation with RevoMPM... 5 3.3 Installing the Revolution R Enterprise JAR File... 6 3.4 Setting Environment Variables for Hadoop... 7 3.5 Creating Directories for Revolution R Enterprise... 8 3.6 Installing on a Cloudera Manager System Using a Cloudera Manager Parcel... 9 4 Verifying Installation... 10 5 Troubleshooting the Installation... 11 5.1 No Valid Credentials... 11 5.2 Unable to Load Class RevoScaleR... 12 5.3 Classpath Errors... 12 5.4 Unable to Load Shared Library... 12 6 Getting Started with Hadoop... 13

1 Introduction Revolution R Enterprise is the scalable data analytics solution, and it is designed to work seamlessly whether your computing environment is a single-user workstation, a local network of connected servers, or a cluster in the cloud. This manual is intended for those who need to configure a Hadoop cluster for use with Revolution R Enterprise. This manual assumes that you have download instructions for the Revolution R Enterprise and related files; if you do not have those instructions, please contact Revolution Analytics Technical Support for assistance. 1.1 System Requirements Revolution R Enterprise works with the following Hadoop distributions: Cloudera CDH4 and CDH5 HortonWorks HDP 1.3.0, HDP 2.0.0, HDP 2.1.0 MapR 3.0.2, MapR 3.1.0, MapR 3.1.1 Your cluster installation must include the C APIs contained in the libhdfs package; these are required for Revolution R Enterprise. See your Hadoop documentation for information on installing this package. The Hadoop distribution must be installed on Red Hat Enterprise Linux 5 or 6, or fully compatible operating system. Revolution R Enterprise should be installed on all nodes of the cluster. Revolution R Enterprise requires Hadoop MapReduce and the Hadoop Distributed File System (HDFS) (for CDH4, HDP 1.3.0, and MapR 3.x installations), or HDFS, Hadoop YARN, and Hadoop MapReduce2 for CDH5 and HDP 2.x installations. The HDFS, YARN, and MapReduce clients must be installed on all nodes on which you plan to run Revolution R Enterprise, as must Revolution R Enterprise itself. Minimum system configuration requirements for Revolution R Enterprise are as follows: Processor: 64-bit CPU with x86-compatible architecture (variously known as AMD64, Intel64, x86-64, IA-32e, EM64T, or x64 CPUs). Itanium-architecture CPUs (also known as IA-64) are not supported. Multiple-core CPUs are recommended. Operating System: Red Hat Enterprise Linux 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6.0, 6.1, 6.2, or 6.3. Only 64-bit operating systems are supported. Memory: A minimum of 4GB of RAM is required; 8GB or more are recommended. Disk Space: A minimum of 500MB of disk space is required on each node. 1.2 Basic Hadoop Terminology The following terms apply to computers and services within the Hadoop cluster, and define the roles of hosts within the cluster:

2 Introduction Hadoop 1.x Installations (CDH4, HDP 1.3.0, MapR 3.x) JobTracker: The Hadoop service that distributes MapReduce tasks to specific nodes in the cluster. The JobTracker queries the NameNode to find the location of the data needed for the tasks, then distributes the tasks to TaskTracker nodes near (or coextensive with) the data. For small clusters, the JobTracker may be running on the NameNode, but this is not recommended for production use. NameNode: A host in the cluster that is the master node of the HDFS file system, managing the directory tree of all files in the file system. In small clusters, the NameNode may host the JobTracker, but this is not recommended for production use. TaskTracker: Any host that can accept tasks (Map, Reduce, and Shuffle operations) from a JobTracker. TaskTrackers are usually, but not always, also DataNodes, so that tasks assigned to the TaskTracker can work on data on the same node. DataNode: A host that stores data in the Hadoop Distributed File System. DataNodes connect to the NameNode, and responds to requests from the NameNode for file system operations. Hadoop 2.x Installations (CDH5, HDP 2.x) Resource Manager: The Hadoop service that distributes MapReduce and other Hadoop tasks to specific nodes in the cluster. The Resource Manager takes over the scheduling functions of the old JobTracker, determining which nodes are appropriate for the current job. NameNode: A host in the cluster that is the master node of the HDFS file system, managing the directory tree of all files in the file system. Application Master: New in MapReduce2/YARN, the application master takes over the task progress coordination from the old JobTracker, working with node managers on the individual task nodes. The application master negotiates with the Resource Manager for cluster resources, which are allocated as a set of containers, with each container running an application-specific task on a particular node. NodeManager: Node managers manage the containers allocated for a given task on a given node, coordinating with the Resource Manager and the Application Masters. NodeManagers are usually, but not always, also DataNodes, and most frequently the containers on a given node are working with data on the same node. DataNode: A host that stores data in the Hadoop Distributed File System. DataNodes connect to the NameNode, and responds to requests from the NameNode for file system operations. 1.3 Verifying the Hadoop Installation We assume you have already installed Hadoop on your cluster. If not, use the documentation provided with your Hadoop distribution to help you perform the

Introduction 3 installation; Hadoop installation is complicated and involves many steps--following the documentation carefully does not guarantee success, but it does make troubleshooting easier. In our testing, we have found the following documents helpful: Cloudera CDH4, package install Cloudera CDH4, Cloudera Manager parcel install Cloudera CDH5, package install Cloudera CDH5, Cloudera Manager parcel install Hortonworks HDP 1.3 Hortonworks HDP 2.1 Hortonworks HDP 1.x or 2.x, Ambari install MapR 3.1 (M5 Edition) If you are using Cloudera Manager, it is important to know if your installation was via packages or parcels; the Revolution R Enterprise Cloudera Manager parcel can be used only with parcel installs. If you have installed Cloudera Manager via packages, do not attempt to use the RRE Cloudera Manager parcel; use the standard Revolution R Enterprise for Linux installer instead. It is useful to confirm that Hadoop itself is running correctly before attempting to install Revolution R Enterprise on the cluster. Hadoop comes with example programs that you can run to verify that your Hadoop installation is running properly, in the jar file hadoopmapreduce-examples.jar. The following command should display a list of the available examples: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar (On MapR, the quick installation installs the Hadoop files to /opt/mapr by default; the path to the examples jar file is /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2- dev-examples.jar. Similarly, on Cloudera Manager parcel installs, the default path to the examples is /opt/cloudera/parcels/cdh/lib/hadoop-mapreduceexamples.jar.) The following runs the pi example, which uses Monte Carlo sampling to estimate pi; the 5 tells Hadoop to use 5 mappers, the 300 says to use 300 samples per map: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 5 300 If you can successfully run one or more of the Hadoop examples, your Hadoop installation was successful and you are ready to install Revolution R Enterprise. 1.4 Adjusting Hadoop Memory Limits (Hadoop 2.x Systems Only) On Hadoop 2.x systems only (CDH5 and HDP 2.x), we have found that the default settings for Map and Reduce memory limits are inadequate for large RevoScaleR jobs. We need to modify four properties in mapred-site.xml and one in yarn-site.xml, as follows:

4 Hadoop Security with Kerberos Authentication (in mapred-site.xml) <name>mapreduce.map.memory.mb</name> <value>2048</value> <name>mapreduce.reduce.memory.mb</name> <value>2048</value> <name>mapreduce.map.java.opts</name> <value>-xmx1229m</value> <name>mapreduce.reduce.java.opts</name> <value>-xmx1229m</value> (in yarn-site.xml) <name>yarn.nodemanager.resource.memory-mb</name> <value>3198</value> If you are using a cluster manager such as Cloudera Manager or Ambari, these settings must usually be modified using the Web interface. 2 Hadoop Security with Kerberos Authentication By default, most Hadoop configurations are relatively insecure. Security features such as SELinux and IPtables firewalls are often turned off to help get the Hadoop cluster up and running quickly. However, Cloudera and Hortonworks distributions of Hadoop support Kerberos authentication, which allows Hadoop to operate in a much more secure manner. To use Kerberos authentication with your particular version of Hadoop, see one of the following documents: o Cloudera CDH4 o Cloudera CDH4 with Cloudera Manager 4 o Cloudera CDH5 o Cloudera CDH5 with Cloudera Manager 5 o Hortonworks HDP 1.3 o Hortonworks HDP 2.x o Hortonworks HDP (1.3 or 2.x) with Ambari If you have trouble restarting your Hadoop cluster after enabling Kerberos authentication, the problem is most likely with your keytab files. Be sure you have created all the required Kerberos principals and generated appropriate keytab entries for all of your nodes, and that the keytab files have been located correctly with the appropriate permissions. (We have found that in Hortonworks clusters managed with Ambari, it is important that the spnego.service.keytab file be present on all the nodes of the cluster, not just the name node and secondary namenode.) The MapR distribution also supports Kerberos authentication, but most MapR installations use that distribution s wire-level security feature. See the MapR Security Guide for details.

Installing Revolution R Enterprise on a Cluster 5 3 Installing Revolution R Enterprise on a Cluster It is highly recommended that you install Revolution R Enterprise as root on each node of your Hadoop cluster. This ensures that all users will have access to it by default. Nonroot installs are supported, but require that the path to the R executable files be added to each user s path. 3.1 Standard Command Line Install For most users, installing on the cluster means simply running the standard Revolution R Enterprise installer on each node of the cluster. This can be done most quickly by doing the following on each node: 1. Copy the installer Revo-Ent-7.2.0-RHELn.tar.gz to the node (where n is either 5 or 6, depending on your cluster s operating system). 2. Unpack the installer by issuing the command: tar zxf Revo-Ent-7.2.0-RHELn.tar.gz 3. Change directory to the RevolutionR_7.2.0 directory created, and issue the following command:./install.py -n -d -a This installs Revolution R Enterprise with the standard options. 3.2 Distributed Installation with RevoMPM If your Hadoop cluster is configured to allow passwordless-ssh access among the various nodes, you can use the Revolution Multi-Node Package Manager (RevoMPM) to deploy Revolution R Enterprise across your cluster. On any one node of the cluster, create a directory for the installer, such as /var/tmp/revoinstall, and download the following files to that directory (you can find the links in your welcome e-mail): install_mpm.py RevoMPM-0.3-4.x86_64.rpm In that same directory, create an empty file named hosts.cfg and a subdirectory named packages, and download the Revolution R Enterprise installer Revo-Ent-7.2.0- RHEL*.tar.gz to that packages subdirectory. To ensure ready access to your nodes via RevoMPM, edit the file hosts.cfg to list the nodes in your cluster. For example: [groups] nodes =

6 Installing Revolution R Enterprise on a Cluster ip-10-0-0-132 ip-10-0-0-133 ip-10-0-0-134 ip-10-0-0-135 ip-10-0-0-136 Change directory to the /var/tmp/revo-install directory, and issue the following command: python install_mpm.py This launches a script that will prompt you for the location of your Revolution R Enterprise installer (accept the default), then will prompt you for either a hosts.cfg file (accept the default if you have edited it as described above) or to manually specify groups. In the latter case, you will be prompted for a group name (this is just a convenient way of referring to your cluster) and then the names of the hosts in the group (the nodes you want to install to). You can define multiple groups (you can do this in the hosts.cfg file as well). You will also be asked which version of Revolution R Enterprise you want to install. Answer the prompts and RevoMPM will install Revolution R Enterprise on all the requested nodes. If you are not running as root, you must specify a Revolution installation directory when running install_mpm.py. The directory must be writable by the user running install_mpm.py: python install_mpm.py --nonroot /home/ec2-user/revolution For complete instructions on installing and running RevoMPM, see the RevoMPM User s Guide. 3.3 Installing the Revolution R Enterprise JAR File Using Revolution R Enterprise in Hadoop requires the presence of the Revolution R Enterprise Java Archive (JAR) file scaler-hadoop-0.1-snapshot.jar. This file can be found in the RevolutionR_7.2.0 directory created when the installer tarball is unpacked; it should be installed on each node of your Hadoop cluster in the standard location for Hadoop JAR files, typically /usr/lib/hadoop/lib. If you are using RevoMPM, you can install the JAR files on all the nodes of your group with the following command (this command must be entered on a single line, and there should be no space between revo- and install ): revompm cmd:'sudo cp /var/tmp/revoinstall/revolutionr_7.2.0/scaler-hadoop-0.1-snapshot.jar /usr/lib/hadoop/lib' Ensure that the file has execute permissions by executing the following command: revompm cmd:'sudo chmod a+x /usr/lib/hadoop/lib/scaler-hadoop-0.1- SNAPSHOT.jar'

Installing Revolution R Enterprise on a Cluster 7 3.4 Setting Environment Variables for Hadoop The following steps must be performed on each host that will be involved in HDFSbased computations. 1. Install the gethadoopclasspath.py file, available in the RevolutionR_7.2.0 directory, on each host upon which Revolution R Enterprise will be run by logging in as root and downloading the file to root s home directory. Then, execute the following commands: cd mkdir -p /usr/local/sbin/ chmod uog+rx /usr/local/sbin/ cp gethadoopclasspath.py /usr/local/sbin chmod uog+rx /usr/local/sbin/gethadoopclasspath.py 2. As the root user, ensure that a Revolution directory under /etc/profile.d exists and has the correct permissions: mkdir -p /etc/profile.d/revolution chmod uog+rx /etc/profile.d/revolution 3. Place the file bash_profile_additions (again, available in the RevolutionR_7.2.0 directory) into the newly created directory: cp bash_profile_additions /etc/profile.d/revolution chmod uog+rx /etc/profile.d/revolution/bash_profile_additions 4. Edit the following configuration file: /etc/profile.d/revolution/bash_profile_additions Uncomment those lines which are necessary as described in the inline comments relating to each configuration. In particular, make sure that the line that runs gethadoopclasspath.py is uncommented. Modify those paths per the provided instructions that are host specific as directed by the inline comments. 5. Place the file rhadoop.sh (again, available in the RevolutionR_7.2.0 directory) into the directory /etc/profile.d. Ensure the file is world-readable. 6. Edit the file /etc/profile.d/rhadoop.sh as appropriate for your Hadoop distribution. 7. Each user who will be using Revolution R on the host must add the following line to his or her $HOME/.bash_profile file:. /etc/profile.d/revolution/bash_profile_additions (Note the dot at the beginning of the line; that is part of the command to source the file.)

8 Installing Revolution R Enterprise on a Cluster 8. Each user should log out and back in to pick up environment changes. The following environment variables should be set when you are done (the easiest way to ensure that this is done consistently is to edit and uncomment the corresponding lines in the bash_profile_additions file): HADOOP_HOME This should be set to the directory containing the Hadoop files. HADOOP_VERSION This should be set to the current Hadoop version, such as 0.20.2-ch3u3. HADOOP_CMD: This should be set to the command used to invoke Hadoop PATH: This should be updated to include your Hadoop command and your Java executables. CLASSPATH: This should be a fully expanded CLASSPATH with access to all required Hadoop JAR files. JAVA_LIBRARY_PATH: If necessary, this should be set as described in the bashprofiile-additions. The one environment variable that should NOT be set in bash_profile_additions is the following: HADOOP_STREAMING: This should be set (in /etc/profile.d/rhadoop.sh) to the path of the Hadoop streaming jar file. 3.5 Creating Directories for Revolution R Enterprise The /var/revoshare directory should be created for use by Revolution R Enterprise and its users on the Hadoop cluster s native file system. The /user/revoshare directory should be created on the Hadoop Distributed File System. Both should have read, write, and execute permissions for all authorized users, and each user should have a personal user directory beneath each top-level directory. The /tmp and /share directories should also exist with read, write, and execute permissions on HDFS. The following commands run from the shell prompt as the hdfs user (as the mapr user on MapR systems) should create the necessary HDFS directories: hadoop fs -mkdir /tmp hadoop fs -chmod uog+rwx /tmp hadoop fs -mkdir /share hadoop fs -chmod uog+rwx /share hadoop fs -mkdir /user hadoop fs -chmod uog+rwx /user hadoop fs -mkdir /user/revoshare/ hadoop fs -chmod uog+rwx /user/revoshare/

Installing Revolution R Enterprise on a Cluster 9 The root user should then create the following directory in the native file system: sudo mkdir -p /var/revoshare chmod uog+rwx /var/revoshare Each user should then ensure that the appropriate user directories exist, and if necessary, create them with the following commands: hadoop fs -mkdir /user/revoshare/$user hadoop fs -chmod uog+rwx /user/revoshare/$user mkdir -p /var/revoshare/$user chmod uog+rwx /var/revoshare/$user The HDFS directory can also be created in a user s R session (provided the top-level /user/revoshare has the appropriate permissions) using the following RevoScaleR commands (substitute your actual user name for username ): rxhadoopmakedir("/user/revoshare/username") rxhadoopcommand("fs -chmod uog+rwx /user/revoshare/username") 3.6 Installing on a Cloudera Manager System Using a Cloudera Manager Parcel If you are running a Cloudera Hadoop cluster managed by Cloudera Manager, and if Cloudera itself was installed via a Cloudera Manager parcel, you can use the Revolution R Enterprise Cloudera Manager parcel to install Revolution R Enterprise on all the nodes of your cluster. Revolution R Enterprise requires several packages that may not be in a default Red Hat Enterprise Linux installation, run the following yum command as root to install them: yum install gcc-gfortran cairo-devel python-devel \ tk-devel libicu Once you have installed the Revolution R Enterprise prerequisites, install the Cloudera Manager parcel as follows: 1. Download the Revolution R Enterprise Cloudera Manager Parcel using the links provided in your welcome e-mail. (Note that the parcel consists of two files, the parcel itself and its associated.sha file.) 2. Copy the parcel files to your local parcel-repo, typically /opt/cloudera/parcel-repo. 3. In your browser, open Cloudera Manager. 4. Click Hosts in the upper navigation bar to bring up the All Hosts page. 5. Click Parcels to bring up the Parcels page. 6. Click Check for New Parcels. RRE 7.2.0-1 should appear with a Distribute button. 7. Click the RRE 7.2.0-1 Distribute button. Revolution R Enterprise will be distributed to all the nodes of your cluster. When the distribution is complete, the Distribute button is replaced with an Activate button. 8. Click Activate. Activation prepares Revolution R Enterprise to be used by the cluster after a restart. 9. Run the following script as root on the node running the Cloudera Manager:

10 Verifying Installation /opt/cloudera/parcels/rre/scripts/rre_hdfs_run_once.sh If you get a permission denied error on running the script, make sure that the file is executable; if not, use the chmod command to make it so: cd /opt/cloudera/parcels/rre/scripts chmod +x rre_hdfs_run_once.sh 10. Have any users who will be using any of your managed nodes to run Revolution R Enterprise add the following line to their.bash-profile files (the period at the beginning represents the bash source command):. /etc/profile.d/revolution/bash_profile_additions 4 Verifying Installation After completing installation, do the following to verify that Revolution R Enterprise will actually run commands in Hadoop: 1. If the cluster is security-enabled, obtain a ticket using kinit (for Kerberos authentication) or mapr password (for MapR wire security). 2. Start Revolution R Enterprise on a cluster node by typing Revo64 at a shell prompt. 3. At the R prompt >, enter the following commands (these commands are drawn from the RevoScaleR Hadoop Getting Started Guide, which explains what all of them are doing. For now, we are just trying to see if everything works): bigdatadirroot <- "/share" myhadoopcluster <- RxHadoopMR() rxsetcomputecontext(myhadoopcluster) source <- system.file("sampledata/airlinedemosmall.csv", package="revoscaler") inputdir <- file.path(bigdatadirroot,"airlinedemosmall") rxhadoopmakedir(inputdir) rxhadoopcopyfromlocal(source, inputdir) hdfsfs <- RxHdfsFileSystem() colinfo <- list(dayofweek = list(type = "factor", levels = c("monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) airds <- RxTextData(file = inputdir, missingvaluestring = "M", colinfo = colinfo, filesystem = hdfsfs) adssummary <- rxsummary(~arrdelay+crsdeptime+dayofweek, data = airds) adssummary If you installed Revolution R Enterprise in a non-default location, you must specify the location using both the hadooprpath and revopath arguments to RxHadoopMR: myhadoopcluster <- RxHadoopMR(hadoopRPath=/path/to/Revo64, revopath=/path/to/revo64)

Troubleshooting the Installation 11 If you see the following, congratulations: Call: rxsummary(formula = ~ArrDelay + CRSDepTime + DayOfWeek, data = airds) Summary Statistics Results for: ~ArrDelay + CRSDepTime + DayOfWeek Data: airds ( RxTextData Data Source) File name: /share/airlinedemosmall Number of valid observations: 6e+05 Name Mean StdDev Min Max ValidObs MissingObs ArrDelay 11.31794 40.688536-86.000000 1490.00000 582628 17372 CRSDepTime 13.48227 4.697566 0.016667 23.98333 600000 0 Category Counts for DayOfWeek Number of categories: 7 Number of valid observations: 6e+05 Number of missing observations: 0 DayOfWeek Counts Monday 97975 Tuesday 77725 Wednesday 78875 Thursday 81304 Friday 82987 Saturday 86159 Sunday 94975 Next try to run a simple rxexec job: rxexec(list.files) That should return a list of files in the native file system. If either the call to rxsummary or the call to rxexec results in an error, see section 5, Troubleshooting the Installation, for a few of the more common errors and how to fix them. 5 Troubleshooting the Installation No two Hadoop installations are exactly alike, but most are quite similar. This section brings together a number of common errors seen in attempting to run Revolution R Enterprise commands on Hadoop clusters, and the most likely causes of such errors from our experience. 5.1 No Valid Credentials If you see a message such as No valid credentials provided, this means you do not have a valid Kerberos ticket. Quit Revolution R Enterprise, obtain a Kerberos ticket using kinit, and then restart Revolution R Enterprise.

12 Troubleshooting the Installation 5.2 Unable to Load Class RevoScaleR If you see a message about being unable to find or load main class RevoScaleR, this means that the jar file scaler-hadoop-0.1-snapshot.jar could not be found. This jar file must be in a location where it can be found by the gethadoopclasspath.py script, or its location must be explicitly added to the CLASSPATH. 5.3 Classpath Errors If you see other errors related to Java classes, these are likely related to the settings of the following environment variables: PATH CLASSPATH JAVA_LIBRARY_PATH Of these, the most commonly misconfigured is the CLASSPATH. Ensure that the script gethadoopclasspath.py has execute permission and is being executed when the bash_profile_additions script is sourced. 5.4 Unable to Load Shared Library If you see a message about being unable to load libhdfs.so, you may need to create a symbolic link from your installed version of libhdfs.so to the system library, such as the following: ln -s /path/to/libhdfs.so /usr/lib64/libhdfs.so Or, update your LD_LIBRARY_PATH environment variable to include the path to the libjvm shared object: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/libhdfs.so Similarly, if you see a message about being unable to load libjvm.so, you may need to create a symbolic link from your installed version of libjvm.so to the system library, such as the following: ln -s /path/to/libjvm.so /usr/lib/libjvm.so Or, update your LD_LIBRARY_PATH environment variable to include the path to the libjvm shared object: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/libjvm.so

Getting Started with Hadoop 13 6 Getting Started with Hadoop To get started with Revolution R Enterprise on Hadoop, we recommend the RevoScaleR 7 Hadoop Getting Started Guide PDF. This provides a tutorial introduction to using RevoScaleR with Hadoop.