RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

RHadoop and MapR Accessing Enterprise- Grade Hadoop from R Version 2.0 (14.March.2014)

Table of Contents Introduction... 3 Environment... 3 R... 3 Special Installation Notes... 4 Install R... 5 Install RHadoop... 5 Install rhdfs... 5 Install rmr2... 7 Install rhbase... 10 Conclusion... 14 Resources... 14 2 RHadoop and MapR

Introduction RHadoop is an open source collection of three R packages created by Revolution Analytics that allow users to manage and analyze data with Hadoop from an R environment. It allows data scientists familiar with R to quickly utilize the enterprise- grade capabilities of the MapR Hadoop distribution directly with the analytic capabilities of R. This paper provides step- by- step instructions to install and use RHadoop with MapR and R on RedHat Enterprise Linux. RHadoop consists of the following packages: rhdfs - functions providing file management of the HDFS from within R rmr2 - functions providing Hadoop MapReduce functionality in R rhbase - functions providing database management for the HBase distributed database from within R Each of the RHadoop packages can be installed and used independently or in conjunction with each other. Environment The integration testing described in this paper was performed in March 2014 on a 3- node Amazon EC2 cluster. The product versions in the test are listed in the table below. Note that Revolution Analytics currently provides Linux support only on RedHat. Product EC2 AMI Type Root/Boot MapR storage RedHat Enterprise Linux 6.4 64- bit Java MapR M7 HBase GNU R RHadoop rhdfs rmr2 rhbase Apache Thrift Version RHEL- 6.4 x86_64 (ami- 74557e31) m1.large 8GB EBS standard (3) 450GB EBS standard 2.6.32-358.el6.x86_64 java- 1.7.0- openjdk.x86_64 java- 1.7.0- openjdk- devel.x86_64 3.0.2 (3.0.2.22510.GA- 1) 0.94.13 3.0.2 1.0.8 3.0.1 1.2.0 0.9.1 R R is both a language and environment for statistical computation and is freely available from GNU. Revolution Analytics provides two versions of R: the free Revolution R Community and the premium Revolution R Enterprise for workstations and servers. Revolution R Community is an enhanced distribution of the open source R for users 3 RHadoop and MapR

looking for faster performance and greater stability. Revolution R Enterprise adds commercial enhancements and professional support for real- world use. It brings higher performance, greater scalability, and stronger reliability to R at a fraction of the cost of legacy products. R is easily extended with libraries that are distributed in packages. Packages are collections of R functions, data, and compiled code in a well- defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used. Special Installation Notes These installation instructions are specific to the product versions specified earlier in this document. Some modifications may be required for your environment. A package repository must be available to install dependent packages. The MapR cluster must be installed and running either Hadoop HBase or MapR tables. You will need root privileges on all nodes in the cluster. MapR installation instructions are available on the MapR Documentation web site: http://doc.mapr.com/display/mapr/quick+installation+guide All commands entered by the user are in bold courier font. Commands entered from within the R environment are preceded by the default R prompt. Linux shell commands are preceded by either the '' character (for the root user) or the '$' character (for a non- root user). For example, the following represents running the yum command as the root user: yum install git -y The following example represents running a command within the R shell: library(rhdfs) Similar to the R shell, the following represents running a command within the HBase shell: hbase(main):001:0 create 'mytable', 'cf1' Note that some Linux commands are long and wrap across multiple lines in this document. Linux commands use the backslash "\" character to escape the carriage return. Similarly, in the R shell, long commands use the comma "," character to escape the carriage return. This facilitates copying from this document and pasting into a Linux terminal window. Here is an example of a long Linux command that wraps to multiple lines in this document: su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-*-\ examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out" And there is an example of a long R command that wraps to multiple lines in this document: install.packages(c('rcpp'), 4 RHadoop and MapR

Unless otherwise indicated, all commands should be run as the root user on the client and task tracker systems. This document assumes there is a non- root user called user01 on your client system for purposes of validating the installation. You can use any non- root user in those commands that can access the MapR cluster. Just remember to replace all occurrences of user01 with your non- root user. Finally, note that these installation instructions assume your client systems are running RedHat Linux as that is the only operating system supported by Revolution Analytics. Install R Testing for this paper was done with GNU R. If you have Revolution R Community or Enterprise, you can use that version of R instead. Follow Revolution Analytics installation instructions for the appropriate edition. A version of R must always be installed on the client system accessing the cluster using the RHadoop libraries. Additionally, to execute MapReduce jobs with the rmr2 library, R must be installed on all task tracker nodes in the cluster. As the root user, follow the installation steps below to install GNU R on all client systems and task trackers. 1) Install GNU R. yum -y --enablerepo=epel install R 2) Install the GNU R developer package. yum -y --enablerepo=epel install R-devel Note that the R-devel package may already be up to date when installing the R package. 3) Confirm installation was successful by running R as a non- root user on your client system and all your task trackers. At the command line, type the following command to determine the version of R that is installed. su - user01 -c "R --version" Install RHadoop The installation instructions that follow are complete for each RHadoop package (rhdfs, rmr2, rhbase). System administrators can skip to installation instructions of just the package(s) they want to install. Recall that R must be installed before installing any of the RHadoop packages. Install rhdfs The rhdfs package uses the hadoop command to access MapR file services. To use rhdfs, R and the rhdfs package only need to be installed on the client system that is accessing the cluster. This can be a node in the cluster or it can be any client system that can access the cluster with the hadoop command. As the root user, perform the following steps on every client node. 1) Confirm that you can access the MapR file services by listing the contents of the top- level directory. su - user01 -c "hadoop fs -ls /" 2) Install the rjava R package that is required by rhdfs. R --save 5 RHadoop and MapR

install.packages(c('rjava'), repos="http://cran.revolutionanalytics.com") quit() 3) Download the rhdfs package from github. yum install git -y cd ~ git clone git://github.com/revolutionanalytics/rhdfs.git 4) Set the HADOOP_CMD environment variable to the hadoop command script and install the rhdfs package. Whereas the rjava package in the previous step was downloaded and installed from a CRAN repository, rhdfs is installed from with the R client. export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop R CMD INSTALL ~/rhdfs/pkg 5) Set required environment variables. In addition to the HADOOP_CMD environment variable set in the previous step, LD_LIBRARY_PATH must include the location of the MapR client library libmaprclient.so and HADOOP_CONF must specify the Hadoop configuration directory. Any user wanting to use the rhdfs library must set these environment variables. export LD_LIBRARY_PATH=/opt/mapr/lib:$LD_LIBRARY_PATH export HADOOP_CONF=/opt/mapr/hadoop/hadoop-0.20.2/conf 6) Switch user to your non- root user to validate the installation was successful. su user01 $ 7) From R, load the rhdfs library and confirm that you can access the MapR cluster file system by listing the root directory. $ R --no-save library(rhdfs) hdfs.init() hdfs.ls('/') quit() Note: When loading an R library using the library() command, dependent libraries will also be loaded. For rhdfs, the rjava library will be loaded if it has not already been loaded. 8) Exit from your su command. 6 RHadoop and MapR

$ exit 9) Check the installation of the rhdfs package. R CMD check ~/rhdfs/pkg; echo $? An exit code of 0 means the installation was successful. If no errors are reported, you have successfully installed rhdfs and can use it to access the MapR file services from R (you can safely ignore any "notes"). 10) (optional) Persist the required environment variable settings for the shells in your client environment. The command below will set the variables for bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" \ /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" \ /etc/profile Install rmr2 The rmr2 package uses Hadoop Streaming to invoke R map and reduce functions. To use rmr2, the R and the rmr2 packages must be installed on the clients as well as every task tracker node in the MapR cluster. Install rmr2 on every tasktracker node AND on every client system as the root user. 1) Install required R packages: R REPL --no-save install.packages(c('rcpp'), install.packages(c('rjsonio'), install.packages(c('itertools'), install.packages(c('digest'), install.packages(c('functional'), 7 RHadoop and MapR

install.packages(c('stringr'), install.packages(c('plyr'), install.packages(c('bitops'), install.packages(c('reshape2'), install.packages(c('catools'), quit() Note: Warnings may be safely ignored while the packages are being built. 2) Download the quickcheck and rmr2 packages from github. RHadoop includes the quickcheck package to support writing randomized unit tests performed by the rmr2 package check. cd ~ git clone git://github.com/revolutionanalytics/quickcheck.git git clone git://github.com/revolutionanalytics/rmr2.git 3) Install the quickcheck and rmr2 packages. R CMD INSTALL ~/quickcheck/pkg R CMD INSTALL ~/rmr2/pkg Note: Warnings may be safely ignored while the packages are being built. 4) From any task tracker node, create a directory called /tmp (if it doesn't already exist) in the root of your MapR- FS (owned by the mapr user) and give it global read- write permissions. This directory is required for running MapReduce applications in R using the rmr2 package. su - mapr -c "hadoop fs -mkdir /tmp" Note: the command above will fail gracefully if the /tmp directory already exists. su - mapr -c "hadoop fs -chmod 777 /tmp" 8 RHadoop and MapR

Validate rmr2 on any single client system as a non- root user with the following steps. Note that the rmr2 package must be installed on ALL the task trackers before proceeding. 1) Confirm that your MapR cluster is configured to run a simple MapReduce job outside of the RHadoop environment. You only need to perform this step on one client from which you intend to run your R MapReduce programs. su - user01 -c "hadoop fs -mkdir /tmp/mrtest/wc-in" su - user01 -c "hadoop fs -put /opt/mapr/notice.txt /tmp/mrtest/wc-in" su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-*-\ examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out" su - user01 -c "hadoop fs -cat /tmp/mrtest/wc-out/part-r-00000" su - user01 -c "hadoop fs -rmr /tmp/mrtest" 2) Copy the wordcount.r script to a directory that is accessible by your non- root user. cp ~/rmr2/pkg/tests/wordcount.r /tmp 3) Set the required environment variables HADOOP_CMD and HADOOP_STREAMING. Since rmr2 uses Hadoop Streaming, it needs access to both the hadoop command and the streaming jar. export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop export HADOOP_STREAMING=/opt/mapr/hadoop/hadoop-\ 0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar 4) Switch user to your non- root user to validate the installation was successful. su user01 5) Run the wordcount.r program from the R environment as your non- root user. $ R --no-save < /tmp/wordcount.r; echo $? Note that an exit code of 0 means the command was successful. 6) Exit from your su command. $ exit 7) Run the full rmr2 check. The examples run by the rmr2 check below will sequentially generate 81 streaming MapReduce jobs on the cluster. 80 of the jobs have just 2 mappers so a large cluster will not speed this up. On a 3- node medium EC2 cluster with two task trackers, the examples take just over 1 hour. You may wish to launch this under nohup as shown below and wait for it to complete before you proceed in this document. nohup R CMD check ~/rmr2/pkg ~/rmr2-check.out & 9 RHadoop and MapR

Check the output in ~/rmr2-check.out for any errors. 8) (optional) Persist the required environment variable settings for the shells in your client environment. The command below will set the variables for Bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" \ /etc/profile echo "export HADOOP_CONF=/opt/mapr/hadoop/hadoop-0.20.2/conf" \ /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" \ /etc/profile Install rhbase The rhbase package accesses HBase via the HBase Thrift server which is included in the MapR HBase distribution. The rhbase package is a Thrift client that sends requests and receives responses from the Thrift server. The Thrift server listens for Thrift requests and in turn uses the HBase HTable java class to access HBase. Since rhbase is a client- side technology, it only needs to be installed on the client system that will access the MapR HBase cluster. Any MapR HBase cluster node can also be a client. For the client system to access a local Thrift server, the client system must have the mapr-hbase-internal packaged installed which includes the MapR HBase Thrift server. If your client system is one of the MapR HBase Masters or Region Servers, it will already have this package installed. These rhbase installation instructions assume that mapr-hbase-internal is already installed on the client system. In addition to the HBase Thrift server, the rhbase package requires the Thrift include files to compile and the C++ thrift library at runtime in order to be a Thrift client. Since these Thrift components are not included in the MapR distribution, Thrift must be installed before rhbase. By default, rhbase connects to a Thrift server on the local host. A remote server can be specified in the rhbase hb.init() call, but the rhbase package check expects the Thrift server to be local. These installation instructions assume the Thrift server is running locally and that HBase is installed and running in your cluster. As the root user, perform the following installation steps on ALL task tracker nodes and client systems. 1) Install (or update) prerequisite packages. yum -y install automake libtool flex bison pkgconfig gcc-c++ boost-devel \ libevent-devel zlib-devel python-devel openssl-devel ruby-devel qt qt-\ devel php-devel 2) Download, build, and install Thrift. cd ~ git clone https://git-wip-us.apache.org/repos/asf/thrift.git thrift cd ~/thrift 10 RHadoop and MapR

sed -i s/2.65/2.63/ configure.ac./bootstrap.sh./configure make && make install /sbin/ldconfig /usr/lib/libthrift-0.9.1.so 3) Download the rhbase package. cd ~ git clone git://github.com/revolutionanalytics/rhbase.git 4) Modify the thrift.pc file. We need to add the thrift directory to the end of the includedir configuration. sed -i \ 's/^includedir=\${prefix}\/include/includedir=\${prefix}\/include\/thrift\ /' ~/thrift/lib/cpp/thrift.pc 5) Install the rhbase package. The LD_LIBRARY_PATH must be set to find the Thrift library (libthrift.so) which was installed as part of the Thrift installation. Also, PKG_CONFIG_PATH must point to the directory containing the thrift.pc package configuration file. export LD_LIBRARY_PATH=/usr/local/lib export PKG_CONFIG_PATH=~/thrift/lib/cpp R CMD INSTALL ~/rhbase/pkg 6) Configure the file /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml with the HBase zookeeper servers and their port number. For HBase Master Servers and HBase Region Servers, zookeeper servers should already be properly configured in this file. For client only systems, edit the hbase.zookeeper.quorum and hbase.zookeeper.property.clientport properties to correspond to your zookeeper servers. su - mapr -c "maprcli node listzookeepers" su - mapr -c "vi /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml"... <property <namehbase.zookeeper.quorum</name <valuezkhost1,zkhost2,zkhost3</value </property <property <namehbase.zookeeper.property.clientport</name <value5181</value </property... 11 RHadoop and MapR

7) Start the MapR HBase Thrift server as a background daemon. /opt/mapr/hbase/hbase-0.94.13/bin/hbase-daemon.sh start thrift Note: Pass the parameters stop thrift to the hbase-daemon.sh script to stop the daemon. 8) Run the rhbase package checks. LD_LIBRARY_PATH and PKG_CONFIG_PATH must still be set. Note that the command will produce errors you can safely ignore if you are not running Hadoop HBase in your MapR cluster. Recall that the rhbase package is compatible with MapR tables which does not require Hadoop HBase to be installed and running in your MapR cluster. R CMD check ~/rhbase/pkg Note: When running the rhbase package check, warnings can be safely ignored. 9) (optional) Persist the required environment variable settings for the shells in your environment. The command below will set the variables for Bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export \ LD_LIBRARY_PATH=/usr/local/lib:/usr/lib64/R/library/rhbase/libs:\$\ LD_LIBRARY_PATH" /etc/profile echo "export PKG_CONFIG_PATH=~root/thrift/lib/cpp" /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" \ /etc/profile Test rhbase with Hadoop HBase The rhbase package is now installed and ready for use by any user on the system. Validate that a non- root user can access HBase via the HBase shell and from rhbase. Note that the instructions below assume the environment variables LD_LIBRARY_PATH and HADOOP_CMD have been configured for the user01 user. They also assume you are running Hadoop HBase in your MapR cluster. 1) Start the HBase shell, create a table, display its description, and drop it. su user01 -c "hbase shell" hbase(main):001:0 create 'mytable', 'cf1' hbase(main):002:0 describe 'mytable' hbase(main):003:0 disable 'mytable' hbase(main):004:0 drop 'mytable' hbase(main):005:0 12 RHadoop and MapR

quit 2) Now perform the same test with rhbase. su user01 -c "R --save" library(rhbase) hb.init() hb.new.table('mytable', 'cf1') hb.describe.table('mytable') hb.disable.table('mytable') hb.delete.table('mytable') q() Test rhbase with MapR Tables The rhbase package is compatible with MapR tables. You can use the MapR tables feature if you have an M7 license installed on your MapR cluster. Simply use absolute paths in MapR- FS for your table names rather than relative paths as for HBase. The following steps assume that the user01 home directory is in a MapR file system called /mapr/mycluster/home/user01. 1) Start the HBase shell, create a table, display its description, and drop it. su user01 -c "hbase shell" hbase(main):001:0 create '/mapr/mycluster/home/user01/mytable', 'cf1' hbase(main):002:0 describe '/mapr/mycluster/home/user01/mytable' hbase(main):003:0 disable '/mapr/mycluster/home/user01/mytable' hbase(main):004:0 drop '/mapr/mycluster/home/user01/mytable' hbase(main):005:0 quit 2) Now perform the same test with rhbase. su user01 -c "R --save" library(rhbase) hb.init() hb.new.table('/mapr/mycluster/home/user01/mytable', 'cf1') 13 RHadoop and MapR

hb.describe.table('/mapr/mycluster/home/user01/mytable') hb.delete.table('/mapr/mycluster/home/user01/mytable') q() Conclusion With Revolution Analytics RHadoop packages and MapR s enterprise grade Hadoop distribution, data scientists can utilize the full potential of Hadoop from the familiar R environment. Resources More information can be found on RHadoop, and other technologies referenced in this paper at the links below. GNU R home page: http://www.r- project.org RHadoop home page: https://github.com/revolutionanalytics/ Apache Thrift: http://thrift.apache.org Revolution Analytics: http://www.revolutionanalytics.com MapR Technologies: http://www.mapr.com About MapR Technologies MapR s advanced distribution for Apache Hadoop delivers on the promise of Hadoop, making the management and analysis of big data a practical reality for more organizations. MapR s advanced capabilities, such as streaming analytics, mission- critical data protection, and MapR tables expand the breadth and depth of use cases across industries. About Revolution Analytics Revolution Analytics (formerly Revolution Computing) was founded in 2007 to foster the R Community, as well as support the growing needs of commercial users. Our name derives from combining the letter "R" with the word "evolution." It speaks to the ongoing development of the R language from an open- source academic research tool into commercial applications for industrial use. 03/14/2014 14 RHadoop and MapR