Technical Note: Configure SparkR to use TIBCO Enterprise Runtime for R



Similar documents
How to Run Spark Application

RHadoop Installation Guide for Red Hat Enterprise Linux

TIBCO Spotfire Statistics Services Installation and Administration Guide

Running Knn Spark on EC2 Documentation

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Subversion Server for Windows

Specops Command. Installation Guide

TIBCO Spotfire Automation Services Installation and Configuration

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

CycleServer Grid Engine Support Install Guide. version 1.25

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

TIBCO Spotfire Metrics Prerequisites and Installation

Creating a Java application using Perfect Developer and the Java Develo...

TIBCO Spotfire Statistics Services Installation and Administration

TIBCO Enterprise Administrator Release Notes

TIBCO ActiveMatrix Management Agent for WCF Samples. Software Release July 2009

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Installing and Running the Google App Engine On Windows

Big Data Frameworks: Scala and Spark Tutorial

TIBCO Spotfire Statistics Services Installation and Administration. Release 5.5 May 2013

EMC Documentum Connector for Microsoft SharePoint

TIBCO Spotfire Statistics Services Installation and Administration Guide. Software Release 5.0 November 2012

Monitoring Oracle Enterprise Performance Management System Release Deployments from Oracle Enterprise Manager 12c

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Workshop for WebLogic introduces new tools in support of Java EE 5.0 standards. The support for Java EE5 includes the following technologies:

IBM WebSphere Application Server Version 7.0

Archive Attender Version 3.5

TIBCO ActiveMatrix BPM - Integration with Content Management Systems

TIBCO ActiveMatrix BusinessWorks Plug-in for TIBCO Managed File Transfer Software Installation

Oracle Big Data Fundamentals Ed 1 NEW

Dell SupportAssist Version 2.0 for Dell OpenManage Essentials Quick Start Guide

Wavecrest Certificate

How to deploy fonts using Configuration Manager 2012 R2

MOODLE Installation on Windows Platform

Setting up the Oracle Warehouse Builder Project. Topics. Overview. Purpose

SparkLab May 2015 An Introduction to

TIBCO Spotfire Automation Services 6.5. Installation and Deployment Manual

How To Install Hadoop From Apa Hadoop To (Hadoop)

CPSC 491. Today: Source code control. Source Code (Version) Control. Exercise: g., no git, subversion, cvs, etc.)

Kerberos authentication between multiple domains may fail on LiveCycle Rights Management ES 8.2.1

Testing Spark: Best Practices

By default, STRM provides an untrusted SSL certificate. You can replace the untrusted SSL certificate with a self-signed or trusted certificate.

Tool-Assisted Knowledge to HL7 v3 Message Translation (TAMMP) Installation Guide December 23, 2009

Getting Started using the SQuirreL SQL Client

Jenkins on Windows with StreamBase

Citrix EdgeSight for Load Testing Installation Guide. Citrix EdgeSight for Load Testing 3.8

ITG Software Engineering

Installation Guide. . All right reserved. For more information about Specops Inventory and other Specops products, visit

Simba XMLA Provider for Oracle OLAP 2.0. Linux Administration Guide. Simba Technologies Inc. April 23, 2013

Code Estimation Tools Directions for a Services Engagement

Immersion Day. Creating an Elastic Load Balancer. Rev

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

HASP Troubleshooting Guide

TIBCO Runtime Agent Domain Utility User s Guide Software Release November 2012

Overview of Web Services API

Hadoop Basics with InfoSphere BigInsights

Implementing a SAS Metadata Server Configuration for Use with SAS Enterprise Guide

Deploying Microsoft Operations Manager with the BIG-IP system and icontrol

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

TIBCO Hawk SNMP Adapter Installation

Building a Continuous Integration Pipeline with Docker

WEB2CS INSTALLATION GUIDE

Configuring and Integrating JMX

Important Notice. (c) Cloudera, Inc. All rights reserved.

Spectrum Scale HDFS Transparency Guide

BPM Scheduling with Job Scheduler

Using GitHub for Rally Apps (Mac Version)

Using The Hortonworks Virtual Sandbox

HP Operations Manager Software for Windows Integration Guide

ACTIVE DIRECTORY DEPLOYMENT

PowerPanel for Linux Software

HADOOP CLUSTER SETUP GUIDE:

Installing Client GPO Software

SAS Data Loader 2.1 for Hadoop

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

LR120 Load Runner 12.0 Essentials Instructor-Led Training Version 12.0

TIBCO ActiveMatrix BPM Integration with Content Management Systems Software Release September 2013

ALERT installation setup

TIBCO Spotfire Server Deployment and Administration

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

File S1: Supplementary Information of CloudDOE

Actian Vortex Express 3.0

SQL Server Instance-Level Benchmarks with DVDStore

HDFS to HPCC Connector User's Guide. Boca Raton Documentation Team

Shark Installation Guide Week 3 Report. Ankush Arora

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Oracle SOA Suite 11g Oracle SOA Suite 11g HL7 Inbound Example

Redatam+SP REtrieval of DATa for Small Areas by Microcomputer

Hadoop Data Warehouse Manual

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

ITG Software Engineering

Cloud Homework instructions for AWS default instance (Red Hat based)

3. Installation and Configuration. 3.1 Java Development Kit (JDK)

Using Symantec NetBackup with Symantec Security Information Manager 4.5

LRGS Client Getting Started Guide

Application Servers - BEA WebLogic. Installing the Application Server

Installation Guide for FTMS and Node Manager 1.6.0

Transcription:

Technical Note: Configure SparkR to use TIBCO Enterprise Runtime for R Software Release 3.2 May 2015 Two-Second Advantage

Configure SparkR to use TIBCO Enterprise Runtime for R 1 The SparkR package is an R package that provides a front end for using the Apache Spark system for distributed computation. SparkR allows using R to invoke Spark jobs, which can then call R to perform computations on distributed worker nodes. You can modify the SparkR source to call the TIBCO Enterprise Runtime for R (TERR) engine rather than the R engine by following the instructions contained in this technical note. To use TERR with SparkR, you must be able to perform the following tasks. Install Hadoop (with Yarn) and Spark. Install open-source R. Download the SparkR package source. Modify and build the SparkR package. Install TERR, version 3.2 or later, and link it to your modified SparkR package. Installing and downloading the required components Before you configure TIBCO Enterprise Runtime for R to work with SparkR, you must download the sources for the SparkR package, and you must install Hadoop and Spark. We have tested the configuration on the versions listed in these instructions. Perform this task from a browser on a computer that meets the requirements for running Hadoop with Yarn, Spark, and TERR. You must have installed open-source R. You must have installed TIBCO Enterprise Runtime for R. You must be able to install Hadoop and Spark. You must be able to download SparkR sources. 1. Install Hadoop 2.6.0 (with Yarn). a) Browse to hadoop.apache.org. b) Follow the instructions to install Hadoop with Yarn. We have tested this configuration with Hadoop 2.6.0. 2. Install Spark 1.3.0. a) Browse to spark.apache.org. b) Follow the instructions to install Spark. We have tested this configuration with SparkR 1.3.0. 3. Download the sources for SparkR. a) Browse to https://github.com/amplab- extras/sparkr-pkg/. b) Download the sources using git. For our test, we pulled the sources for the master branch with the last change as follows: commit 2167eec8187e3a10b08e3328ed6c2b5fc449edde Merge: a5eb4fd 1d6ff10 Author: Zongheng Yang <zongheng.y@gmail.com> Date: Tue Apr 7 23:14:41 2015-0700 Merge pull request #244 from sun-rui/sparkr-154_5 [SPARKR-154] Phase 4: implement subtract() and subtractbykey().

2 Modify the SparkR sources to not specify a hard-wired command to Rscript. Modifying the SparkR sources to use a specified engine The SparkR package we tested assumes that worker nodes can call the R engine using the hard-wired command Rscript. To use SparkR with TIBCO Enterprise Runtime for R (TERR), we needed to change SparkR sources so it can call a different engine instead. Perform this task using a code editor on a computer that meets the prerequisites. You must have installed open-source R. You must have installed TIBCO Enterprise Runtime for R. You must have completed the steps described in Installing and downloading the required components. Change the SparkR source code to allow a different command to invoke the engine by making the following change to the file SparkR-pkg/pkg/src/src/main/ scala/edu/berkeley/cs/amplab/ sparkr/rrdd.scala. Change private def createrprocess(rlibdir: String, port: Int, script: String) = { val rcommand = "Rscript" val roptions = "--vanilla"... to private def createrprocess(rlibdir: String, port: Int, script: String) = { val rcommand = SparkEnv.get.conf.get("spark.sparkr.r.command", "Rscript") val roptions = "--vanilla"... At some point, we hope that we can contribute this change to the SparkR sources. Build, configure, and test SparkR with TERR. Building, configuring, and testing SparkR with TIBCO Enterprise Runtime for R After you have changed SparkR source to allow for another engine, you must build the package, and then configure it to use the TIBCO Enterprise Runtime for R (TERR) engine, and then test the results. Perform this task on a computer that meets all of the prerequisites described in Modifying the SparkR sources. You must have modified the SparkR sources to specify the engine.

3 1. Build the SparkR package. For example, from the command line, issue the following command. cd /home/git/sparkr-pkg USE_YARN=1 SPARK_VERSION=1.3.0 SPARK_YARN_VERSION=2.6.0 SPARK_HADOOP_VERSION=2.6.0./install-dev.sh We built the package using the following versions of scala and sbt: Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL sbt launcher version 0.13.8 2. Allow TERR to access the SparkR package. For example, if you have TERR (version 3.2 or later) installed in /home/terr, add a link from the TERR library to the SparkR library just built: ln -s /home/git/sparkr-pkg/lib/sparkr /home/terr/library/sparkr 3. Test TERR with SparkR. /home/terr/bin/terr --no-restore --no-save library(sparkr) sc <- sparkr.init(master="local", sparkenvir=list(spark.ui.showconsoleprogress="false", spark.sparkr.use.daemon="false", spark.sparkr.r.command="/home/terr/bin/terrscript")) reduce(parallelize(sc, 1:10),"+") # test: should return 55 The parameter spark.sparkr.r.command specifies the command to be used, in place of Rscript, when invoking the engine from worker nodes. Here, the path is given to the TERRscript command. We specify spark.sparkr.use.daemon="false" so that SparkR does not create a daemon R process to spawn R engines. The SparkR code for this daemon uses several functions that are not currently implemented in TERR (for example, parallel:::mcfork, parallel:::mcexit, tools::pskill, socketselect). Eventually we want to support these in TERR. The parameter spark.ui.showconsoleprogress="false" is not required for using TERR, but it is useful: it turns off the progress bar printed to the console during Spark operations. If you have problems with using TERR with SparkR, try configuring SparkR to use open-source R. See Troubleshooting your SparkR configuration. Troubleshooting your SparkR configuration If you experience problems using SparkR with TIBCO Enterprise Runtime for R, try testing SparkR with open-source R. You must be able to modify and build the SparkR source. You must have installed open-source R. 1. Make the SparkR package available to open-source R. 2. Start open-source R. 3. In the open-source R console, load the SparkR package. 4. Run a test script and evaluate the results.

4 Example: Testing open-source R with SparkR ## make the SparkR package available to R ln -s /home/git/sparkr-pkg/lib/sparkr /home/r/library/sparkr ## run R, load sparkr, run test /home/r/bin/r --no-restore --no-save library(sparkr) sc <- sparkr.init(master="local") reduce(parallelize(sc, 1:10),"+") # test: should return 55