RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)



Similar documents
RHadoop Installation Guide for Red Hat Enterprise Linux

Revolution R Enterprise 7 Hadoop Configuration Guide

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Single Node Hadoop Cluster Setup

Hadoop Installation MapReduce Examples Jake Karnes

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

How To Install Hadoop From Apa Hadoop To (Hadoop)

Hadoop Hands-On Exercises

Revolution R Enterprise 7 Hadoop Configuration Guide

Using The Hortonworks Virtual Sandbox

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

CDH 5 Quick Start Guide

OpenGeo Suite for Linux Release 3.0

SparkLab May 2015 An Introduction to

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

HSearch Installation

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Hadoop Training Hands On Exercise

Single Node Setup. Table of contents

TP1: Getting Started with Hadoop

How to Install and Configure EBF15328 for MapR or with MapReduce v1

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

A. Aiken & K. Olukotun PA3

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Running Kmeans Mapreduce code on Amazon AWS

Hadoop Data Warehouse Manual

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Introduction to HDFS. Prasanth Kothuri, CERN

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Hadoop (pseudo-distributed) installation and configuration

Big Business, Big Data, Industrialized Workload

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop Lab - Setting a 3 node Cluster. Java -

MapReduce. Tushar B. Kute,

Hadoop Hands-On Exercises

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Data processing goes big

Package hive. January 10, 2011

Hadoop Setup. 1 Cluster

JobScheduler - Amazon AMI Installation

Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster

How to Run Spark Application

Source Code Management for Continuous Integration and Deployment. Version 1.0 DO NOT DISTRIBUTE

Extreme computing lab exercises Session one

Pivotal HD Enterprise

IDS 561 Big data analytics Assignment 1

CSE-E5430 Scalable Cloud Computing. Lecture 4

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Install BA Server with Your Own BA Repository

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

HDFS Cluster Installation Automation for TupleWare

COURSE CONTENT Big Data and Hadoop Training

Symantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0

Deploying MongoDB and Hadoop to Amazon Web Services

Hadoop Setup Walkthrough

ITG Software Engineering

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Integrating Apache Web Server with Tomcat Application Server

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT

Contents Set up Cassandra Cluster using Datastax Community Edition on Amazon EC2 Installing OpsCenter on Amazon AMI References Contact

Basic Hadoop Programming Skills

INSTALLING KAAZING WEBSOCKET GATEWAY - HTML5 EDITION ON AN AMAZON EC2 CLOUD SERVER

IUCLID 5 Guidance and support. Installation Guide Distributed Version. Linux - Apache Tomcat - PostgreSQL

Tutorial- Counting Words in File(s) using MapReduce

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Hadoop Basics with InfoSphere BigInsights

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

File S1: Supplementary Information of CloudDOE

Installation Guide. McAfee VirusScan Enterprise for Linux Software

docs.hortonworks.com

HOD Scheduler. Table of contents

Hadoop Tutorial. General Instructions

Getting Started with Amazon EC2 Management in Eclipse

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installation and Configuration Documentation

Peers Techno log ies Pv t. L td. HADOOP

Integrating VoltDB with Hadoop

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Installation of PHP, MariaDB, and Apache

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

HADOOP - MULTI NODE CLUSTER

Installing and Configuring MySQL as StoreGrid Backend Database on Linux

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

A Study of Data Management Technology for Handling Big Data

SAP Predictive Analysis Installation

ALERT installation setup

INF-110. GPFS Installation

Transcription:

RHadoop and MapR Accessing Enterprise- Grade Hadoop from R Version 2.0 (14.March.2014)

Table of Contents Introduction... 3 Environment... 3 R... 3 Special Installation Notes... 4 Install R... 5 Install RHadoop... 5 Install rhdfs... 5 Install rmr2... 7 Install rhbase... 10 Conclusion... 14 Resources... 14 2 RHadoop and MapR

Introduction RHadoop is an open source collection of three R packages created by Revolution Analytics that allow users to manage and analyze data with Hadoop from an R environment. It allows data scientists familiar with R to quickly utilize the enterprise- grade capabilities of the MapR Hadoop distribution directly with the analytic capabilities of R. This paper provides step- by- step instructions to install and use RHadoop with MapR and R on RedHat Enterprise Linux. RHadoop consists of the following packages: rhdfs - functions providing file management of the HDFS from within R rmr2 - functions providing Hadoop MapReduce functionality in R rhbase - functions providing database management for the HBase distributed database from within R Each of the RHadoop packages can be installed and used independently or in conjunction with each other. Environment The integration testing described in this paper was performed in March 2014 on a 3- node Amazon EC2 cluster. The product versions in the test are listed in the table below. Note that Revolution Analytics currently provides Linux support only on RedHat. Product EC2 AMI Type Root/Boot MapR storage RedHat Enterprise Linux 6.4 64- bit Java MapR M7 HBase GNU R RHadoop rhdfs rmr2 rhbase Apache Thrift Version RHEL- 6.4 x86_64 (ami- 74557e31) m1.large 8GB EBS standard (3) 450GB EBS standard 2.6.32-358.el6.x86_64 java- 1.7.0- openjdk.x86_64 java- 1.7.0- openjdk- devel.x86_64 3.0.2 (3.0.2.22510.GA- 1) 0.94.13 3.0.2 1.0.8 3.0.1 1.2.0 0.9.1 R R is both a language and environment for statistical computation and is freely available from GNU. Revolution Analytics provides two versions of R: the free Revolution R Community and the premium Revolution R Enterprise for workstations and servers. Revolution R Community is an enhanced distribution of the open source R for users 3 RHadoop and MapR

looking for faster performance and greater stability. Revolution R Enterprise adds commercial enhancements and professional support for real- world use. It brings higher performance, greater scalability, and stronger reliability to R at a fraction of the cost of legacy products. R is easily extended with libraries that are distributed in packages. Packages are collections of R functions, data, and compiled code in a well- defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used. Special Installation Notes These installation instructions are specific to the product versions specified earlier in this document. Some modifications may be required for your environment. A package repository must be available to install dependent packages. The MapR cluster must be installed and running either Hadoop HBase or MapR tables. You will need root privileges on all nodes in the cluster. MapR installation instructions are available on the MapR Documentation web site: http://doc.mapr.com/display/mapr/quick+installation+guide All commands entered by the user are in bold courier font. Commands entered from within the R environment are preceded by the default R prompt. Linux shell commands are preceded by either the '' character (for the root user) or the '$' character (for a non- root user). For example, the following represents running the yum command as the root user: yum install git -y The following example represents running a command within the R shell: library(rhdfs) Similar to the R shell, the following represents running a command within the HBase shell: hbase(main):001:0 create 'mytable', 'cf1' Note that some Linux commands are long and wrap across multiple lines in this document. Linux commands use the backslash "\" character to escape the carriage return. Similarly, in the R shell, long commands use the comma "," character to escape the carriage return. This facilitates copying from this document and pasting into a Linux terminal window. Here is an example of a long Linux command that wraps to multiple lines in this document: su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-*-\ examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out" And there is an example of a long R command that wraps to multiple lines in this document: install.packages(c('rcpp'), 4 RHadoop and MapR

Unless otherwise indicated, all commands should be run as the root user on the client and task tracker systems. This document assumes there is a non- root user called user01 on your client system for purposes of validating the installation. You can use any non- root user in those commands that can access the MapR cluster. Just remember to replace all occurrences of user01 with your non- root user. Finally, note that these installation instructions assume your client systems are running RedHat Linux as that is the only operating system supported by Revolution Analytics. Install R Testing for this paper was done with GNU R. If you have Revolution R Community or Enterprise, you can use that version of R instead. Follow Revolution Analytics installation instructions for the appropriate edition. A version of R must always be installed on the client system accessing the cluster using the RHadoop libraries. Additionally, to execute MapReduce jobs with the rmr2 library, R must be installed on all task tracker nodes in the cluster. As the root user, follow the installation steps below to install GNU R on all client systems and task trackers. 1) Install GNU R. yum -y --enablerepo=epel install R 2) Install the GNU R developer package. yum -y --enablerepo=epel install R-devel Note that the R-devel package may already be up to date when installing the R package. 3) Confirm installation was successful by running R as a non- root user on your client system and all your task trackers. At the command line, type the following command to determine the version of R that is installed. su - user01 -c "R --version" Install RHadoop The installation instructions that follow are complete for each RHadoop package (rhdfs, rmr2, rhbase). System administrators can skip to installation instructions of just the package(s) they want to install. Recall that R must be installed before installing any of the RHadoop packages. Install rhdfs The rhdfs package uses the hadoop command to access MapR file services. To use rhdfs, R and the rhdfs package only need to be installed on the client system that is accessing the cluster. This can be a node in the cluster or it can be any client system that can access the cluster with the hadoop command. As the root user, perform the following steps on every client node. 1) Confirm that you can access the MapR file services by listing the contents of the top- level directory. su - user01 -c "hadoop fs -ls /" 2) Install the rjava R package that is required by rhdfs. R --save 5 RHadoop and MapR

install.packages(c('rjava'), repos="http://cran.revolutionanalytics.com") quit() 3) Download the rhdfs package from github. yum install git -y cd ~ git clone git://github.com/revolutionanalytics/rhdfs.git 4) Set the HADOOP_CMD environment variable to the hadoop command script and install the rhdfs package. Whereas the rjava package in the previous step was downloaded and installed from a CRAN repository, rhdfs is installed from with the R client. export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop R CMD INSTALL ~/rhdfs/pkg 5) Set required environment variables. In addition to the HADOOP_CMD environment variable set in the previous step, LD_LIBRARY_PATH must include the location of the MapR client library libmaprclient.so and HADOOP_CONF must specify the Hadoop configuration directory. Any user wanting to use the rhdfs library must set these environment variables. export LD_LIBRARY_PATH=/opt/mapr/lib:$LD_LIBRARY_PATH export HADOOP_CONF=/opt/mapr/hadoop/hadoop-0.20.2/conf 6) Switch user to your non- root user to validate the installation was successful. su user01 $ 7) From R, load the rhdfs library and confirm that you can access the MapR cluster file system by listing the root directory. $ R --no-save library(rhdfs) hdfs.init() hdfs.ls('/') quit() Note: When loading an R library using the library() command, dependent libraries will also be loaded. For rhdfs, the rjava library will be loaded if it has not already been loaded. 8) Exit from your su command. 6 RHadoop and MapR

$ exit 9) Check the installation of the rhdfs package. R CMD check ~/rhdfs/pkg; echo $? An exit code of 0 means the installation was successful. If no errors are reported, you have successfully installed rhdfs and can use it to access the MapR file services from R (you can safely ignore any "notes"). 10) (optional) Persist the required environment variable settings for the shells in your client environment. The command below will set the variables for bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" \ /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" \ /etc/profile Install rmr2 The rmr2 package uses Hadoop Streaming to invoke R map and reduce functions. To use rmr2, the R and the rmr2 packages must be installed on the clients as well as every task tracker node in the MapR cluster. Install rmr2 on every tasktracker node AND on every client system as the root user. 1) Install required R packages: R REPL --no-save install.packages(c('rcpp'), install.packages(c('rjsonio'), install.packages(c('itertools'), install.packages(c('digest'), install.packages(c('functional'), 7 RHadoop and MapR

install.packages(c('stringr'), install.packages(c('plyr'), install.packages(c('bitops'), install.packages(c('reshape2'), install.packages(c('catools'), quit() Note: Warnings may be safely ignored while the packages are being built. 2) Download the quickcheck and rmr2 packages from github. RHadoop includes the quickcheck package to support writing randomized unit tests performed by the rmr2 package check. cd ~ git clone git://github.com/revolutionanalytics/quickcheck.git git clone git://github.com/revolutionanalytics/rmr2.git 3) Install the quickcheck and rmr2 packages. R CMD INSTALL ~/quickcheck/pkg R CMD INSTALL ~/rmr2/pkg Note: Warnings may be safely ignored while the packages are being built. 4) From any task tracker node, create a directory called /tmp (if it doesn't already exist) in the root of your MapR- FS (owned by the mapr user) and give it global read- write permissions. This directory is required for running MapReduce applications in R using the rmr2 package. su - mapr -c "hadoop fs -mkdir /tmp" Note: the command above will fail gracefully if the /tmp directory already exists. su - mapr -c "hadoop fs -chmod 777 /tmp" 8 RHadoop and MapR

Validate rmr2 on any single client system as a non- root user with the following steps. Note that the rmr2 package must be installed on ALL the task trackers before proceeding. 1) Confirm that your MapR cluster is configured to run a simple MapReduce job outside of the RHadoop environment. You only need to perform this step on one client from which you intend to run your R MapReduce programs. su - user01 -c "hadoop fs -mkdir /tmp/mrtest/wc-in" su - user01 -c "hadoop fs -put /opt/mapr/notice.txt /tmp/mrtest/wc-in" su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-*-\ examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out" su - user01 -c "hadoop fs -cat /tmp/mrtest/wc-out/part-r-00000" su - user01 -c "hadoop fs -rmr /tmp/mrtest" 2) Copy the wordcount.r script to a directory that is accessible by your non- root user. cp ~/rmr2/pkg/tests/wordcount.r /tmp 3) Set the required environment variables HADOOP_CMD and HADOOP_STREAMING. Since rmr2 uses Hadoop Streaming, it needs access to both the hadoop command and the streaming jar. export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop export HADOOP_STREAMING=/opt/mapr/hadoop/hadoop-\ 0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar 4) Switch user to your non- root user to validate the installation was successful. su user01 5) Run the wordcount.r program from the R environment as your non- root user. $ R --no-save < /tmp/wordcount.r; echo $? Note that an exit code of 0 means the command was successful. 6) Exit from your su command. $ exit 7) Run the full rmr2 check. The examples run by the rmr2 check below will sequentially generate 81 streaming MapReduce jobs on the cluster. 80 of the jobs have just 2 mappers so a large cluster will not speed this up. On a 3- node medium EC2 cluster with two task trackers, the examples take just over 1 hour. You may wish to launch this under nohup as shown below and wait for it to complete before you proceed in this document. nohup R CMD check ~/rmr2/pkg ~/rmr2-check.out & 9 RHadoop and MapR

Check the output in ~/rmr2-check.out for any errors. 8) (optional) Persist the required environment variable settings for the shells in your client environment. The command below will set the variables for Bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" \ /etc/profile echo "export HADOOP_CONF=/opt/mapr/hadoop/hadoop-0.20.2/conf" \ /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" \ /etc/profile Install rhbase The rhbase package accesses HBase via the HBase Thrift server which is included in the MapR HBase distribution. The rhbase package is a Thrift client that sends requests and receives responses from the Thrift server. The Thrift server listens for Thrift requests and in turn uses the HBase HTable java class to access HBase. Since rhbase is a client- side technology, it only needs to be installed on the client system that will access the MapR HBase cluster. Any MapR HBase cluster node can also be a client. For the client system to access a local Thrift server, the client system must have the mapr-hbase-internal packaged installed which includes the MapR HBase Thrift server. If your client system is one of the MapR HBase Masters or Region Servers, it will already have this package installed. These rhbase installation instructions assume that mapr-hbase-internal is already installed on the client system. In addition to the HBase Thrift server, the rhbase package requires the Thrift include files to compile and the C++ thrift library at runtime in order to be a Thrift client. Since these Thrift components are not included in the MapR distribution, Thrift must be installed before rhbase. By default, rhbase connects to a Thrift server on the local host. A remote server can be specified in the rhbase hb.init() call, but the rhbase package check expects the Thrift server to be local. These installation instructions assume the Thrift server is running locally and that HBase is installed and running in your cluster. As the root user, perform the following installation steps on ALL task tracker nodes and client systems. 1) Install (or update) prerequisite packages. yum -y install automake libtool flex bison pkgconfig gcc-c++ boost-devel \ libevent-devel zlib-devel python-devel openssl-devel ruby-devel qt qt-\ devel php-devel 2) Download, build, and install Thrift. cd ~ git clone https://git-wip-us.apache.org/repos/asf/thrift.git thrift cd ~/thrift 10 RHadoop and MapR

sed -i s/2.65/2.63/ configure.ac./bootstrap.sh./configure make && make install /sbin/ldconfig /usr/lib/libthrift-0.9.1.so 3) Download the rhbase package. cd ~ git clone git://github.com/revolutionanalytics/rhbase.git 4) Modify the thrift.pc file. We need to add the thrift directory to the end of the includedir configuration. sed -i \ 's/^includedir=\${prefix}\/include/includedir=\${prefix}\/include\/thrift\ /' ~/thrift/lib/cpp/thrift.pc 5) Install the rhbase package. The LD_LIBRARY_PATH must be set to find the Thrift library (libthrift.so) which was installed as part of the Thrift installation. Also, PKG_CONFIG_PATH must point to the directory containing the thrift.pc package configuration file. export LD_LIBRARY_PATH=/usr/local/lib export PKG_CONFIG_PATH=~/thrift/lib/cpp R CMD INSTALL ~/rhbase/pkg 6) Configure the file /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml with the HBase zookeeper servers and their port number. For HBase Master Servers and HBase Region Servers, zookeeper servers should already be properly configured in this file. For client only systems, edit the hbase.zookeeper.quorum and hbase.zookeeper.property.clientport properties to correspond to your zookeeper servers. su - mapr -c "maprcli node listzookeepers" su - mapr -c "vi /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml"... <property <namehbase.zookeeper.quorum</name <valuezkhost1,zkhost2,zkhost3</value </property <property <namehbase.zookeeper.property.clientport</name <value5181</value </property... 11 RHadoop and MapR

7) Start the MapR HBase Thrift server as a background daemon. /opt/mapr/hbase/hbase-0.94.13/bin/hbase-daemon.sh start thrift Note: Pass the parameters stop thrift to the hbase-daemon.sh script to stop the daemon. 8) Run the rhbase package checks. LD_LIBRARY_PATH and PKG_CONFIG_PATH must still be set. Note that the command will produce errors you can safely ignore if you are not running Hadoop HBase in your MapR cluster. Recall that the rhbase package is compatible with MapR tables which does not require Hadoop HBase to be installed and running in your MapR cluster. R CMD check ~/rhbase/pkg Note: When running the rhbase package check, warnings can be safely ignored. 9) (optional) Persist the required environment variable settings for the shells in your environment. The command below will set the variables for Bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export \ LD_LIBRARY_PATH=/usr/local/lib:/usr/lib64/R/library/rhbase/libs:\$\ LD_LIBRARY_PATH" /etc/profile echo "export PKG_CONFIG_PATH=~root/thrift/lib/cpp" /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" \ /etc/profile Test rhbase with Hadoop HBase The rhbase package is now installed and ready for use by any user on the system. Validate that a non- root user can access HBase via the HBase shell and from rhbase. Note that the instructions below assume the environment variables LD_LIBRARY_PATH and HADOOP_CMD have been configured for the user01 user. They also assume you are running Hadoop HBase in your MapR cluster. 1) Start the HBase shell, create a table, display its description, and drop it. su user01 -c "hbase shell" hbase(main):001:0 create 'mytable', 'cf1' hbase(main):002:0 describe 'mytable' hbase(main):003:0 disable 'mytable' hbase(main):004:0 drop 'mytable' hbase(main):005:0 12 RHadoop and MapR

quit 2) Now perform the same test with rhbase. su user01 -c "R --save" library(rhbase) hb.init() hb.new.table('mytable', 'cf1') hb.describe.table('mytable') hb.disable.table('mytable') hb.delete.table('mytable') q() Test rhbase with MapR Tables The rhbase package is compatible with MapR tables. You can use the MapR tables feature if you have an M7 license installed on your MapR cluster. Simply use absolute paths in MapR- FS for your table names rather than relative paths as for HBase. The following steps assume that the user01 home directory is in a MapR file system called /mapr/mycluster/home/user01. 1) Start the HBase shell, create a table, display its description, and drop it. su user01 -c "hbase shell" hbase(main):001:0 create '/mapr/mycluster/home/user01/mytable', 'cf1' hbase(main):002:0 describe '/mapr/mycluster/home/user01/mytable' hbase(main):003:0 disable '/mapr/mycluster/home/user01/mytable' hbase(main):004:0 drop '/mapr/mycluster/home/user01/mytable' hbase(main):005:0 quit 2) Now perform the same test with rhbase. su user01 -c "R --save" library(rhbase) hb.init() hb.new.table('/mapr/mycluster/home/user01/mytable', 'cf1') 13 RHadoop and MapR

hb.describe.table('/mapr/mycluster/home/user01/mytable') hb.delete.table('/mapr/mycluster/home/user01/mytable') q() Conclusion With Revolution Analytics RHadoop packages and MapR s enterprise grade Hadoop distribution, data scientists can utilize the full potential of Hadoop from the familiar R environment. Resources More information can be found on RHadoop, and other technologies referenced in this paper at the links below. GNU R home page: http://www.r- project.org RHadoop home page: https://github.com/revolutionanalytics/ Apache Thrift: http://thrift.apache.org Revolution Analytics: http://www.revolutionanalytics.com MapR Technologies: http://www.mapr.com About MapR Technologies MapR s advanced distribution for Apache Hadoop delivers on the promise of Hadoop, making the management and analysis of big data a practical reality for more organizations. MapR s advanced capabilities, such as streaming analytics, mission- critical data protection, and MapR tables expand the breadth and depth of use cases across industries. About Revolution Analytics Revolution Analytics (formerly Revolution Computing) was founded in 2007 to foster the R Community, as well as support the growing needs of commercial users. Our name derives from combining the letter "R" with the word "evolution." It speaks to the ongoing development of the R language from an open- source academic research tool into commercial applications for industrial use. 03/14/2014 14 RHadoop and MapR