Technical Note: Configure SparkR to use TIBCO Enterprise Runtime for R

Technical Note: Configure SparkR to use TIBCO Enterprise Runtime for R Software Release 3.2 May 2015 Two-Second Advantage

Configure SparkR to use TIBCO Enterprise Runtime for R 1 The SparkR package is an R package that provides a front end for using the Apache Spark system for distributed computation. SparkR allows using R to invoke Spark jobs, which can then call R to perform computations on distributed worker nodes. You can modify the SparkR source to call the TIBCO Enterprise Runtime for R (TERR) engine rather than the R engine by following the instructions contained in this technical note. To use TERR with SparkR, you must be able to perform the following tasks. Install Hadoop (with Yarn) and Spark. Install open-source R. Download the SparkR package source. Modify and build the SparkR package. Install TERR, version 3.2 or later, and link it to your modified SparkR package. Installing and downloading the required components Before you configure TIBCO Enterprise Runtime for R to work with SparkR, you must download the sources for the SparkR package, and you must install Hadoop and Spark. We have tested the configuration on the versions listed in these instructions. Perform this task from a browser on a computer that meets the requirements for running Hadoop with Yarn, Spark, and TERR. You must have installed open-source R. You must have installed TIBCO Enterprise Runtime for R. You must be able to install Hadoop and Spark. You must be able to download SparkR sources. 1. Install Hadoop 2.6.0 (with Yarn). a) Browse to hadoop.apache.org. b) Follow the instructions to install Hadoop with Yarn. We have tested this configuration with Hadoop 2.6.0. 2. Install Spark 1.3.0. a) Browse to spark.apache.org. b) Follow the instructions to install Spark. We have tested this configuration with SparkR 1.3.0. 3. Download the sources for SparkR. a) Browse to https://github.com/amplab- extras/sparkr-pkg/. b) Download the sources using git. For our test, we pulled the sources for the master branch with the last change as follows: commit 2167eec8187e3a10b08e3328ed6c2b5fc449edde Merge: a5eb4fd 1d6ff10 Author: Zongheng Yang <zongheng.y@gmail.com> Date: Tue Apr 7 23:14:41 2015-0700 Merge pull request #244 from sun-rui/sparkr-154_5 [SPARKR-154] Phase 4: implement subtract() and subtractbykey().

2 Modify the SparkR sources to not specify a hard-wired command to Rscript. Modifying the SparkR sources to use a specified engine The SparkR package we tested assumes that worker nodes can call the R engine using the hard-wired command Rscript. To use SparkR with TIBCO Enterprise Runtime for R (TERR), we needed to change SparkR sources so it can call a different engine instead. Perform this task using a code editor on a computer that meets the prerequisites. You must have installed open-source R. You must have installed TIBCO Enterprise Runtime for R. You must have completed the steps described in Installing and downloading the required components. Change the SparkR source code to allow a different command to invoke the engine by making the following change to the file SparkR-pkg/pkg/src/src/main/ scala/edu/berkeley/cs/amplab/ sparkr/rrdd.scala. Change private def createrprocess(rlibdir: String, port: Int, script: String) = { val rcommand = "Rscript" val roptions = "--vanilla"... to private def createrprocess(rlibdir: String, port: Int, script: String) = { val rcommand = SparkEnv.get.conf.get("spark.sparkr.r.command", "Rscript") val roptions = "--vanilla"... At some point, we hope that we can contribute this change to the SparkR sources. Build, configure, and test SparkR with TERR. Building, configuring, and testing SparkR with TIBCO Enterprise Runtime for R After you have changed SparkR source to allow for another engine, you must build the package, and then configure it to use the TIBCO Enterprise Runtime for R (TERR) engine, and then test the results. Perform this task on a computer that meets all of the prerequisites described in Modifying the SparkR sources. You must have modified the SparkR sources to specify the engine.

3 1. Build the SparkR package. For example, from the command line, issue the following command. cd /home/git/sparkr-pkg USE_YARN=1 SPARK_VERSION=1.3.0 SPARK_YARN_VERSION=2.6.0 SPARK_HADOOP_VERSION=2.6.0./install-dev.sh We built the package using the following versions of scala and sbt: Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL sbt launcher version 0.13.8 2. Allow TERR to access the SparkR package. For example, if you have TERR (version 3.2 or later) installed in /home/terr, add a link from the TERR library to the SparkR library just built: ln -s /home/git/sparkr-pkg/lib/sparkr /home/terr/library/sparkr 3. Test TERR with SparkR. /home/terr/bin/terr --no-restore --no-save library(sparkr) sc <- sparkr.init(master="local", sparkenvir=list(spark.ui.showconsoleprogress="false", spark.sparkr.use.daemon="false", spark.sparkr.r.command="/home/terr/bin/terrscript")) reduce(parallelize(sc, 1:10),"+") # test: should return 55 The parameter spark.sparkr.r.command specifies the command to be used, in place of Rscript, when invoking the engine from worker nodes. Here, the path is given to the TERRscript command. We specify spark.sparkr.use.daemon="false" so that SparkR does not create a daemon R process to spawn R engines. The SparkR code for this daemon uses several functions that are not currently implemented in TERR (for example, parallel:::mcfork, parallel:::mcexit, tools::pskill, socketselect). Eventually we want to support these in TERR. The parameter spark.ui.showconsoleprogress="false" is not required for using TERR, but it is useful: it turns off the progress bar printed to the console during Spark operations. If you have problems with using TERR with SparkR, try configuring SparkR to use open-source R. See Troubleshooting your SparkR configuration. Troubleshooting your SparkR configuration If you experience problems using SparkR with TIBCO Enterprise Runtime for R, try testing SparkR with open-source R. You must be able to modify and build the SparkR source. You must have installed open-source R. 1. Make the SparkR package available to open-source R. 2. Start open-source R. 3. In the open-source R console, load the SparkR package. 4. Run a test script and evaluate the results.

4 Example: Testing open-source R with SparkR ## make the SparkR package available to R ln -s /home/git/sparkr-pkg/lib/sparkr /home/r/library/sparkr ## run R, load sparkr, run test /home/r/bin/r --no-restore --no-save library(sparkr) sc <- sparkr.init(master="local") reduce(parallelize(sc, 1:10),"+") # test: should return 55