RevoScaleR 7.3 HPC Server Getting Started Guide



Similar documents
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Big Data Analysis with Revolution R Enterprise

Revolution R Enterprise 7 Hadoop Configuration Guide

Big Data Analysis with Revolution R Enterprise

High Performance Predictive Analytics in R and Hadoop:

Revolution R Enterprise DeployR 7.1 Installation Guide for Windows

Revolution R Enterprise DeployR 7.1 Enterprise Security Guide. Authentication, Authorization, and Access Controls

SPSS: Getting Started. For Windows

Nexio Connectus with Nexio G-Scribe

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Scheduling in SAS 9.4 Second Edition

RevoScaleR Speed and Scalability

NetFlow Collection and Processing Cartridge Pack User Guide Release 6.0

Step-by-Step Guide for Monitoring in Windows HPC Server 2008 Beta 2

Creating and Deploying Active Directory Rights Management Services Templates Step-by-Step Guide

Symantec Backup Exec TM 11d for Windows Servers. Quick Installation Guide

InventoryControl for use with QuoteWerks Quick Start Guide

HYPERION SYSTEM 9 N-TIER INSTALLATION GUIDE MASTER DATA MANAGEMENT RELEASE 9.2

SAS Visual Analytics 7.1 for SAS Cloud. Quick-Start Guide

Veritas Cluster Server Database Agent for Microsoft SQL Configuration Guide

Managing Linux Servers with System Center 2012 R2

VERITAS Backup Exec TM 10.0 for Windows Servers

Monitor Print Popup for Mac. Product Manual.

StarWind iscsi SAN & NAS: Configuring HA File Server on Windows Server 2012 for SMB NAS January 2013

StarWind iscsi SAN & NAS: Configuring HA Storage for Hyper-V October 2012

Introduction to Hyper-V High- Availability with Failover Clustering

TIBCO Spotfire Metrics Modeler User s Guide. Software Release 6.0 November 2013

Enhanced Connector Applications SupportPac VP01 for IBM WebSphere Business Events 3.0.0

Dell Enterprise Reporter 2.5. Configuration Manager User Guide

Bringing Big Data Modelling into the Hands of Domain Experts

ArcGIS 9. Installation Guide: Workgroup for Microsoft SQL Server Express

NETWRIX FILE SERVER CHANGE REPORTER

Web Enabled Software for 8614xB-series Optical Spectrum Analyzers. Installation Guide

Metalogix SharePoint Backup. Advanced Installation Guide. Publication Date: August 24, 2015

DiskPulse DISK CHANGE MONITOR

Elixir Schedule Designer User Manual

Qlik REST Connector Installation and User Guide

Connection Broker Managing User Connections to Workstations, Blades, VDI, and More. Quick Start with Microsoft Hyper-V

Administration GUIDE. SharePoint Server idataagent. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 201

In the same spirit, our QuickBooks 2008 Software Installation Guide has been completely revised as well.

Installation and Program Essentials

Revolution R Enterprise 7 Hadoop Configuration Guide

Chapter 2 The Data Table. Chapter Table of Contents

WebEx Integration to Outlook. User Guide

AssetWise Performance Management. APM Remote Upgrade Guide

NETWRIX EVENT LOG MANAGER

How To Create A Powerpoint Intelligence Report In A Pivot Table In A Powerpoints.Com

NETWRIX EVENT LOG MANAGER

MobileAsset Users Manual Web Module

Dell InTrust Preparing for Auditing Microsoft SQL Server

Quick Start Guide For Ipswitch Failover v9.0

Forms Printer User Guide

Version 5.0. SurfControl Web Filter for Citrix Installation Guide for Service Pack 2

Oracle Data Miner (Extension of SQL Developer 4.0)

Budget Developer Install Manual 2.5

Installation Guide Revision 1.0.

VMware/Hyper-V Backup Plug-in User Guide

Microsoft Dynamics CRM Adapter for Microsoft Dynamics GP

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis

Quadro Configuration Console User's Guide. Table of Contents. Table of Contents

INTEGRATING MICROSOFT DYNAMICS CRM WITH SIMEGO DS3

Administration Guide. . All right reserved. For more information about Specops Deploy and other Specops products, visit

SAS Visual Analytics 7.2 for SAS Cloud: Quick-Start Guide

VERITAS Backup Exec 9.1 for Windows Servers Quick Installation Guide

Novell ZENworks 10 Configuration Management SP3

TIBCO Spotfire Automation Services 6.5. User s Manual

Management Reporter Integration Guide for Microsoft Dynamics GP

IBM Endpoint Manager Version 9.1. Patch Management for Red Hat Enterprise Linux User's Guide

Symantec Backup Exec 2010 R2. Quick Installation Guide

Grid Computing in SAS 9.4 Third Edition

StarWind Virtual SAN Installation and Configuration of Hyper-Converged 2 Nodes with Hyper-V Cluster

ERserver. iseries. Work management

SAS University Edition: Installation Guide for Linux

Usage Analysis Tools in SharePoint Products and Technologies

Symantec Backup Exec 12.5 for Windows Servers. Quick Installation Guide

CA ARCserve Backup for Windows

BrightStor ARCserve Backup for Windows

Drake Hosted User Guide

Backup Exec 15. Quick Installation Guide

Welcome to GIFTS Online Mobile... 3

StarWind iscsi SAN & NAS: Configuring HA Shared Storage for Scale- Out File Servers in Windows Server 2012 January 2013

For Active Directory Installation Guide

Audit Management Reference

SAS University Edition: Installation Guide for Windows

Metalogix Replicator. Quick Start Guide. Publication Date: May 14, 2015

VERSION NINE. Be A Better Auditor. You Have The Knowledge. We Have The Tools. INSTALLATION GUIDE

Deep Freeze and Microsoft System Center Configuration Manager 2012 Integration

Migrating a Windows PC to Run in VMware Fusion VMware Fusion 2.0

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

ODBC Driver User s Guide. Objectivity/SQL++ ODBC Driver User s Guide. Release 10.2

About This Guide Signature Manager Outlook Edition Overview... 5

NOVELL ZENWORKS ENDPOINT SECURITY MANAGEMENT

CA Performance Center

Citrix Access Gateway Plug-in for Windows User Guide

Installation Guide for Workstations

StarWind SMI-S Agent: Storage Provider for SCVMM April 2012

Creating Custom Crystal Reports Tutorial

SAS Marketing Automation 5.1. User s Guide

Web Filter. SurfControl Web Filter 5.0 Installation Guide. The World s #1 Web & Filtering Company

GUARD1 PLUS SE Administrator's Manual

Transcription:

RevoScaleR 7.3 HPC Server Getting Started Guide

The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2014. RevoScaleR 7.3 HPC Server Getting Started Guide. Revolution Analytics, Inc., Mountain View, CA. RevoScaleR 7.3 HPC Server Getting Started Guide Copyright 2014 Revolution Analytics, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of Revolution Analytics. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Revolution R, Revolution R Enterprise, RPE, RevoScaleR, DeployR, RevoTreeView, and Revolution Analytics are trademarks of Revolution Analytics. Revolution R includes the Intel Math Kernel Library (https://software.intel.com/en-us/intel-mkl). RevoScaleR includes Stat/Transfer software under license from Circle Systems, Inc. Stat/Transfer is a trademark of Circle Systems, Inc. Other product names mentioned herein are used for identification purposes only and may be trademarks of their respective owners. Revolution Analytics. 2570 W. El Camino Real Suite 222 Mountain View, CA 94040 USA. Revised on December 9, 2014 We want our documentation to be useful, and we want it to address your needs. If you have comments on this or any Revolution document, write to doc@revolutionanalytics.com.

Contents 1 Overview... 1 2 Running the Examples in This Guide... 2 3 A Tutorial Introduction to RevoScaleR on HPC Server... 2 3.1 Creating A Compute Context for Your HPC Server Cluster... 2 3.2 Creating a Data Source... 3 3.3 Summarizing Your Data... 4 3.4 Fitting a Linear Model... 5 3.5 Creating a Non-Waiting Compute Context... 6 4 Analyzing a Large Data Set with RevoScaleR... 7 4.1 Estimating a Linear Model... 7 4.2 Handling Larger Linear Models... 8 4.3 A Large Data Logistic Regression... 9 5 Extending Your HPC Server Compute Context... 9 5.1 Using the Head Node for Computation... 9 5.2 Specifying a Data Path... 10 5.3 Managing Compute Resources... 10 5.4 Obtaining Cluster Information... 11 5.5 Running a Simple Test of Your Compute Context... 12 6 Copying Data with ClusterCopy... 13 7 Continuing with Distributed Computing... 13

1 Overview This guide is an introduction to using the RevoScaleR package in a Microsoft HPC Server distributed computing environment. RevoScaleR provides functions for performing scalable and extremely high performance data management, analysis, and visualization. Microsoft HPC Server provides node management and job scheduling for distributed computation. This guide focuses on using RevoScaleR s HPA big data capabilities in the HPC Server environment. A Microsoft HPC Server consists of a head node and one or more compute nodes. Workstations belonging to the same Active Directory domain as the HPC Server cluster can be used as client workstations and are able to submit jobs to be run on the cluster. The data manipulation and analysis functions in RevoScaleR are appropriate for small and large datasets, but are particularly useful in three common situations: 1) to analyze data sets that are too big to fit in memory and, 2) to perform computations distributed over several cores, processors, or nodes in a cluster, or 3) to create scalable data analysis routines that can be developed locally with smaller data sets, then deployed to larger data and/or a cluster of computers. These are ideal candidates for RevoScaleR because RevoScaleR is based on the concept of operating on chunks of data and using updating algorithms. The RevoScaleR high performance analysis functions are portable. The same high performance functions work on a variety of computing platforms, including Windows and Linux workstations and servers and distributed computing platforms such as Microsoft HPC Server, IBM Platform LSF, and Hadoop. So, for example, you can do exploratory analysis on your laptop, then deploy the same analytics code on a Microsoft HPC Server cluster. The underlying RevoScaleR code handles the distribution of the computations across cores and nodes, so you don t have to worry about it. For those interested in the underlying architecture, on Microsoft HPC Server the RevoScaleR analysis functions go through the following steps. A master process is initiated to run the main thread of the algorithm. The master process creates worker tasks which are distributed by the Job Scheduler among the available computing resources. The master process gathers the results from the worker tasks, and when all worker tasks are completed, the final results are computed and returned. Additional examples of using RevoScaleR can be found in the following manuals provided with RevoScaleR: RevoScaleR Getting Started Guide (RevoScaleR_Getting_Started.pdf) RevoScaleR User s Guide (RevoScaleR_Users_Guide.pdf) RevoScaleR Distributed Computing Guide [RevoScaleR_Distributed_Computing.pdf; see this guide for HPC examples] RevoScaleR ODBC Data Import Guide (RevoScaleR_ODBC.pdf)

2 Running the Examples in This Guide 2 Running the Examples in This Guide The big dataset used in this guide is the 2012 Airline On-Time Data, a.xdf file containing information on flight arrival and departure details for all commercial flights within the USA, for the year 2012. This is a big data set with over six million observations, and is used in the examples in Section 4. The initial examples use the AirlineDemoSmall.xdf file from the RevoScaleR SampleData directory. 3 A Tutorial Introduction to RevoScaleR on HPC Server This section contains a detailed introduction to the most important high performance analytics features of RevoScaleR using your Microsoft HPC Server cluster. The following tasks are performed: 1. Specify the head node, shared directory, and R path for your cluster. 2. Create a compute context for HPC Server. 3. Create a data source. 4. Summarize your data. 5. Fit a linear model to the data. 3.1 Creating A Compute Context for Your HPC Server Cluster A compute context object, or more briefly a compute context, is the key to distributed computing with RevoScaleR. The compute context specifies the computing resources to be used by RevoScaleR s distributable computing functions. To create a compute context object for use with a Microsoft Windows HPC Server cluster, you use the RxHpcServer function. (In general, we call such a function a compute context constructor.) Before calling this function, however, you need to have three pieces of information that you can obtain from your system administrator: headnode: What is the name of cluster s head node, for example, cluster-head? revopath: Where is Revolution R installed on the nodes? That is, what is the path to the R bin\x64 directory? sharedir: What is the name of the user share directory set up for your use? This directory is used to manage your distributed computing jobs. Job information and intermediate results are written here. (This is a subdirectory of the network share, and you should have read and write access to this directory.) You may also need to obtain from your system administrator the locations of data directories on your HPC Server clusters, which can be used to specify a default

A Tutorial Introduction to RevoScaleR on HPC Server 3 datapath for your compute context, and writable data directories, that can be used to specify a default outdatapath for your compute context. myheadnode <- "cluster-headnode" myrevopath <- "C:/Revolution/R-Enterprise-Node-7.3/R- 3.1.1/bin/x64" mysharedir <- "\\allshare\\user" mydatapath <- "C:\\data" A sixth piece of information, the workingdir, or default working directory, is normally the path stored in the Windows USERPROFILE environment variable, typically something like C:\\Users\\YourName. You should have read and write access to this directory. You don t normally need to specify this. Warning: You will normally be prompted for login credentials the first time you attempt to use the HPC Server Cluster (these credentials are then cached by Windows HPC Server if you select the checkbox labeled Remember my credentials). However, the prompt dialog may be hidden behind the Revolution R Enterprise RPE window, making it appear that the RPE has become frozen. You can normally view the dialog by rightclicking the Taskbar and clicking Show the desktop (or, equivalently, pressing the Windows logo key and M simultaneously). We then use this information to define a compute context for HPC Server: myhpccluster <- RxHpcServer( headnode = myheadnode sharedir = mysharedir revopath = myrevopath) You can have many compute context objects available for use; only one is active at any one time. You specify the active compute context using the rxsetcomputecontext function: rxsetcomputecontext(myhpccluster) 3.2 Creating a Data Source For our first explorations, we will work with one of RevoScaleR s built-in data sets, AirlineDemoSmall.xdf. You can verify that it is installed as follows: adsfile <- system.file("sampledata/airlinedemosmall.xdf", package="revoscaler")file.exists(adsfile) Data sources provide a convenient way to encapsulate information about the data to be used for analysis. They can include information about specific variables or about the file system on which the data can be found, but in this case we only need to specify the file name: airds <- RxXdfData(file = adsfile)

4 A Tutorial Introduction to RevoScaleR on HPC Server 3.3 Summarizing Your Data Use the rxsummary function to obtain descriptive statistics for your data. The rxsummary function takes a formula as its first argument, and the name of the data set as the second. adssummary <- rxsummary(~arrdelay+crsdeptime+dayofweek, data = airds) adssummary Summary statistics will be computed on the variables in the formula, removing missing values for all rows in the included variables: Call: rxsummary(formula = ~ArrDelay + CRSDepTime + DayOfWeek, data = airds) Summary Statistics Results for: ~ArrDelay + CRSDepTime + DayOfWeek Data: airds (RxTextData Data Source) File name: /share/airlinedemosmall Number of valid observations: 6e+05 Name Mean StdDev Min Max ValidObs MissingObs ArrDelay 11.31794 40.688536-86.000000 1490.00000 582628 17372 CRSDepTime 13.48227 4.697566 0.016667 23.98333 600000 0 Category Counts for DayOfWeek Number of categories: 7 Number of valid observations: 6e+05 Number of missing observations: 0 DayOfWeek Counts Monday 97975 Tuesday 77725 Wednesday 78875 Thursday 81304 Friday 82987 Saturday 86159 Sunday 94975 Notice that the summary information shows cell counts for categorical variables, and appropriately does not provide summary statistics such as Mean and StdDev. Also notice that the Call: line will show the actual call you entered or the call provided by rxsummary, so will appear differently in different circumstances. You can also compute summary information by one or more categories by using interactions of a numeric variable with a factor variable. For example, to compute summary statistics on Arrival Delay by Day of Week: rxsummary(~arrdelay:dayofweek, data = airds) The output shows the summary statistics for ArrDelay for each day of the week: Call:

A Tutorial Introduction to RevoScaleR on HPC Server 5 rxsummary(formula = ~ArrDelay:DayOfWeek, data = airds) Summary Statistics Results for: ~ArrDelay:DayOfWeek Data: airds (RxTextData Data Source) File name: /share/airlinedemosmall Number of valid observations: 6e+05 Name Mean StdDev Min Max ValidObs MissingObs ArrDelay:DayOfWeek 11.31794 40.68854-86 1490 582628 17372 Statistics by category (7 categories): Category DayOfWeek Means StdDev Min Max ValidObs ArrDelay for DayOfWeek=Monday Monday 12.025604 40.02463-76 1017 95298 ArrDelay for DayOfWeek=Tuesday Tuesday 11.293808 43.66269-70 1143 74011 ArrDelay for DayOfWeek=Wednesday Wednesday 10.156539 39.58803-81 1166 76786 ArrDelay for DayOfWeek=Thursday Thursday 8.658007 36.74724-58 1053 79145 ArrDelay for DayOfWeek=Friday Friday 14.804335 41.79260-78 1490 80142 ArrDelay for DayOfWeek=Saturday Saturday 11.875326 45.24540-73 1370 83851 ArrDelay for DayOfWeek=Sunday Sunday 10.331806 37.33348-86 1202 93395 3.4 Fitting a Linear Model Use the rxlinmod function to fit a linear model using your airds data source. Use a single dependent variable, the factor DayOfWeek: arrdelaylm1 <- rxlinmod(arrdelay ~ DayOfWeek, data = airds) summary(arrdelaylm1) The resulting output is: Call: rxlinmod(formula = ArrDelay ~ DayOfWeek, data = airds) Linear Regression Results for: ArrDelay ~ DayOfWeek Data: airds (RxTextData Data Source) File name: /share/airlinedemosmall Dependent variable(s): ArrDelay Total independent variables: 8 (Including number dropped: 1) Number of valid observations: 582628 Number of missing observations: 17372 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(> t ) (Intercept) 10.3318 0.1330 77.673 2.22e-16 *** DayOfWeek=Monday 1.6938 0.1872 9.049 2.22e-16 *** DayOfWeek=Tuesday 0.9620 0.2001 4.809 1.52e-06 *** DayOfWeek=Wednesday -0.1753 0.1980-0.885 0.376 DayOfWeek=Thursday -1.6738 0.1964-8.522 2.22e-16 *** DayOfWeek=Friday 4.4725 0.1957 22.850 2.22e-16 *** DayOfWeek=Saturday 1.5435 0.1934 7.981 2.22e-16 *** DayOfWeek=Sunday Dropped Dropped Dropped Dropped ---

6 A Tutorial Introduction to RevoScaleR on HPC Server Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 40.65 on 582621 degrees of freedom Multiple R-squared: 0.001869 Adjusted R-squared: 0.001858 F-statistic: 181.8 on 6 and 582621 DF, p-value: < 2.2e-16 Condition number: 10.5595 3.5 Creating a Non-Waiting Compute Context When running longer analyses on your HPC Server cluster, it can be convenient to let the job run while returning control of your R session to you immediately. You can do this by specifying wait = FALSE in your compute context definition. By using our existing compute context as the first argument, all of the other settings will be carried over to the new compute context: myhpcnowaitcluster <- RxHpcServer(myHpcCluster, wait = FALSE) rxsetcomputecontext(myhpcnowaitcluster) Once you have set your compute context to non-waiting, distributed RevoScaleR functions return quickly with a jobinfo object, which you can use to track the progress of your job, and, when the job is complete, obtain the results of the job. For example, we can re-run our linear model in the non-waiting case as follows: job1 <- rxlinmod(arrdelay ~ DayOfWeek, data = airds) rxgetjobstatus(job1) Right after submission, the job status will typically return "running". When the job status returns "finished", you can obtain the results using rxgetjobresults as follows: arrdelaylm1 <- rxgetjobresults(job1) summary(arrdelaylm1) You should always assign the jobinfo object so that you can easily track your work, but if you forget, the most recent jobinfo object is saved in the global environment as the object rxglastpendingjob. If, after submitting a job in a non-waiting compute context, you decide you don t want to complete the job, you can cancel the job using the rxcanceljob function: job2 <- rxlinmod(arrdelay ~ DayOfWeek, data = airds) rxcanceljob(job2) The rxcanceljob function returns TRUE if the job is successfully cancelled, FALSE otherwise. For the remainder of this tutorial, we return to a waiting compute context: rxsetcomputecontext(myhpccluster)

Analyzing a Large Data Set with RevoScaleR 7 4 Analyzing a Large Data Set with RevoScaleR We will now move to examples using a more recent version of the airline data set. The airline on-time data has been gathered by the Department of Transportation since 1987. The data through 2008 was used in the American Statistical Association Data Expo in 2009 (http://stat-computing.org/dataexpo/2009/). ASA describes the data set as follows: The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. The airline on-time data set for 2012 is available as a.xdf file AirOnTime2012.xdf for download from the Revolution Analytics web site.. Assuming you have copied this file into a directory in the data path on all the nodes of your cluster, we will define our RxXdfData object as follows: airxdfname <- "AirOnTime2012.xdf" airdata <- RxXdfData(airXdfName) 4.1 Estimating a Linear Model We fit our first model to the large airline data much as we created the linear model for the AirlineDemoSmall data, and we time it to see how long it takes to fit the model on this large data set: system.time( delayarr <- rxlinmod(arrdelay ~ DayOfWeek, data = airdata, cube = TRUE) ) To see a summary of your results: summary(delayarr) You should see the following results: Call: rxlinmod(formula = ArrDelay ~ DayOfWeek, data = airdata, cube = TRUE) Cube Linear Regression Results for: ArrDelay ~ DayOfWeek Data: airdata (RxXdfData Data Source) File name: /var/revoshare/v7alpha/airlineontime2012 Dependent variable(s): ArrDelay Total independent variables: 7 Number of valid observations: 6005381 Number of missing observations: 91381

8 Analyzing a Large Data Set with RevoScaleR Coefficients: Estimate Std. Error t value Pr(> t ) Counts DayOfWeek=Mon 3.54210 0.03736 94.80 2.22e-16 *** 901592 DayOfWeek=Tues 1.80696 0.03835 47.12 2.22e-16 *** 855805 DayOfWeek=Wed 2.19424 0.03807 57.64 2.22e-16 *** 868505 DayOfWeek=Thur 4.65502 0.03757 123.90 2.22e-16 *** 891674 DayOfWeek=Fri 5.64402 0.03747 150.62 2.22e-16 *** 896495 DayOfWeek=Sat 0.91008 0.04144 21.96 2.22e-16 *** 732944 DayOfWeek=Sun 2.82780 0.03829 73.84 2.22e-16 *** 858366 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 35.48 on 6005374 degrees of freedom Multiple R-squared: 0.001827 (as if intercept included) Adjusted R-squared: 0.001826 F-statistic: 1832 on 6 and 6005374 DF, p-value: < 2.2e-16 Condition number: 1 4.2 Handling Larger Linear Models The data set contains a variable UniqueCarrier which contains airline codes for 15 carriers. Because the Revolution Compute Engine handles factor variables so efficiently, we can do a linear regression looking at the Arrival Delay by Carrier. This will take a little longer, of course, than the previous analysis, because we are estimating 15 instead of 7 parameters. delaycarrier <- rxlinmod(arrdelay ~ UniqueCarrier, data = airdata, cube = TRUE) Next, sort the coefficient vector so that the we can see the airlines with the highest and lowest values. dccoef <- sort(coef(delaycarrier)) Next, see how the airlines rank in arrival delays: dccoef The result is: UniqueCarrier=AS UniqueCarrier=US UniqueCarrier=DL UniqueCarrier=FL -1.9022483-1.2418484-0.8013952-0.2618747 UniqueCarrier=HA UniqueCarrier=YV UniqueCarrier=WN UniqueCarrier=MQ 0.3143564 1.3189086 2.8824285 2.8843612 UniqueCarrier=OO UniqueCarrier=VX UniqueCarrier=B6 UniqueCarrier=UA 3.9846305 4.0323447 4.8508215 5.0905867 UniqueCarrier=AA UniqueCarrier=EV UniqueCarrier=F9 6.3980937 7.2934991 7.8788931 Notice that Alaska Airlines comes in as the best in on-time arrivals, while Frontier is at the bottom of the list. We can see the difference by subtracting the coefficients:

Extending Your HPC Server Compute Context 9 sprintf("frontier's additional delay compared with Alaska: %f", dccoef["uniquecarrier=f9"]-dccoef["uniquecarrier=as"]) which results in: [1] "Frontier's additional delay compared with Alaska: 9.781141" 4.3 A Large Data Logistic Regression Our airline data includes a logical variable, ArrDel15, that tells whether a flight is 15 or more minutes late. We can use this as the response for a logistic regression to model the likelihood that a flight will be late given the day of the week: logitobj <- rxlogit(arrdel15 ~ DayOfWeek, data = airdata) logitobj which results in: Logistic Regression Results for: ArrDel15 ~ DayOfWeek Data: airdata (RxXdfData Data Source) File name: AirOnTime2012.xdf Dependent variable(s): ArrDel15 Total independent variables: 8 (Including number dropped: 1) Number of valid observations: 6005381 Number of missing observations: 91381 Coefficients: ArrDel15 (Intercept) -1.61122563 DayOfWeek=Mon 0.04600367 DayOfWeek=Tues -0.08431814 DayOfWeek=Wed -0.07319133 DayOfWeek=Thur 0.12585721 DayOfWeek=Fri 0.18987931 DayOfWeek=Sat -0.13498591 DayOfWeek=Sun Dropped 5 Extending Your HPC Server Compute Context In our initial work, we have used a simple compute context that required only the most basic knowledge of our compute context. In the following sections, we discuss a few other options you can use to take full advantage of your HPC Server cluster. 5.1 Using the Head Node for Computation By default, when compute resources are allocated for your compute context, only the cluster s compute nodes are used for RevoScaleR computations. For most large-scale clusters, this is the desired behavior; the head node is used for job scheduling and administration, and the compute nodes perform all actual computation. However, on

10 Extending Your HPC Server Compute Context smaller clusters, the head node s computing resources may be an essential part of the cluster s computing power. To include the head node as a compute resource, specify computeonheadnode=true in your call to RxHpcServer: myhpccluster <- RxHpcServer( headnode = myheadnode sharedir = mysharedir revopath = myrevopath, computeonheadnode = TRUE) 5.2 Specifying a Data Path The datapath argument allows you to specify one or more directory paths that should be searched, in the order specified, for input data. It can be passed either to RxHpcServer or to rxoptions if you want to use the same set of paths for both your local and distributed computations. The datapath argument is a character vector of well-formed Windows paths, either UNC (\\host\dir) or DOS (C:\dir). Because R requires that backslashes be escaped, you must double the number of backslashes when specifying paths, so that UNC-type paths must be entered in the form \\\\host\\dir, and DOS-type paths must be entered in the form C:\\dir. For example, the following datapath specifies a particular user s home directory and a system data directory: datapath = c("c:\\users\\username", "C:\\data") We can add the RevoScaleR sample data path as follows: datapath = c("c:\\users\\yourname", system.file("sampledata", package="revoscaler")) 5.3 Managing Compute Resources By default, RxHpcServer creates a compute context that takes advantage of all available nodes, using a moderate priority and not reserving exclusive use. You can, however, use additional arguments to RxHpcServer to request specific computing resources. For example, if your job requires a minimum number of computing elements (nodes or cores) to complete successfully, you can request that minimum number with the argument minelems. Similarly, if you want to restrict the computation to use no more than a specific number of computing elements, you can supply the argument maxelems. While the scheduler will allocate the requested number of resources, there is no guarantee that all of the requested resources will actually be allocated tasks. So, for example, if you set minelems to 5 and maxelems to 10, the scheduler will allocate somewhere between 5 and 10 compute elements to your job, but it may decide that the job will complete fastest if only three of the allocated elements are actually given work to do.

Extending Your HPC Server Compute Context 11 If you want to increase the priority of your computation, set priority=3 (higher priority) or priority=4 (highest priority). Similarly, to decrease the priority, set priority=1 (lower priority) or priority=0 (lowest priority). You can request specific nodes by supplying a comma-separated character string of node names as the nodes argument. For example, to use compute nodes compute11, compute12, and compute13, specify the nodes argument as follows: nodes = "compute11,compute12,compute13" You can also use a character vector of compute node names, as follows: nodes = c("compute11", "compute12", "compute13") To request exclusive use of the specified computing resources, specify exclusive=true. If your cluster has groups defined, you can use the groups argument to specify which groups to use. If both groups and nodes are specified, the union of the two specifications is used. You can obtain a character vector of all the nodes associated with a given compute context using the function rxgetavailablenodes: rxgetavailablenodes(mycluster) [1] "CLUSTER-HEAD2" "COMPUTE10" "COMPUTE11" "COMPUTE12" [5] "COMPUTE13" Notice that in the example above, the head node has a name (cluster-head2) that is not a valid R syntactic name. In certain cases, rxexec returns lists with components named for the nodes from which the results were returned, but needs for those names to be valid R syntactic names. You can see the names that rxexec will use in those situations by adding the flag makernodenames=true: rxgetavailablenodes(mycluster, makernodenames=true) [1] "CLUSTER_HEAD2" "COMPUTE10" "COMPUTE11" "COMPUTE12" [5] "COMPUTE13" You can use rxgetavailablenodes with an existing compute context to construct the nodes argument for a new compute context; in such a case it will respect the computeonheadnode argument of the existing compute context. 5.4 Obtaining Cluster Information You can obtain even more detailed information about the cluster associated with a given compute context by using the function rxgetnodeinfo. The rxgetnodeinfo function takes a compute context and returns, for each node of the HPC Server cluster associated with that compute context, the following information:

12 Extending Your HPC Server Compute Context the node name the CPU speed the node s available RAM the number of cores on the node the number of sockets on the node the current node state, which may be one of the following: o online o offline o draining (that is, shutting down, but still with jobs in the queue) o unknown whether the node is reachable, that is, whether there is an operational path between the head node and the given node a list of node groups to which the node belongs The rxgetnodeinfo function may return different information when called on non-hpc- Server compute contexts. 5.5 Running a Simple Test of Your Compute Context Once you ve created and registered your compute context, you can use the function rxpingnodes to get a quick view of how ready the nodes in your compute context are to perform jobs. In the simplest case, call rxpingnodes with no arguments to run a simple job on all the nodes in the current compute context: rxpingnodes() When run on our initial HPC Server compute context mycluster, this returns the following if everything works:. status node name ------------------------- success COMPUTE10 success COMPUTE12 success COMPUTE13 success COMPUTE11 Other possible statuses include the following: down the node could not be reached failedjob the scheduler reports that the job failed on that node failedsession the scheduler launched R on the node, but the session failed timeout the job did not complete on the node before the specified timeout The timeout argument to rxpingnodes can be used to specify a time limit on the testing if one or more nodes fail to complete the task by the timeout specified, rxpingnodes returns the timeout status for those nodes. rxpingnodes(timeout=30)

Copying Data with ClusterCopy 13 The filter argument can be used to restrict the output to just those nodes having the specified return status(es): rxpingnodes(filter=c("failedjob", "failedsession")) 6 Copying Data with ClusterCopy ClusterCopy is a tool available as a free download from Microsoft that makes copying data from node to node quick and simple. Once you ve downloaded and installed ClusterCopy, using it is easy: 1. From the Start menu, navigate to All Programs\Microsoft HPC Pack 2008 Tool Pack. (In Windows 8-based systems, 2. Launch ClusterCopy. 3. In the Head node text field, enter the name of your cluster s head node, for example, cluster-head. 4. In the Source folder text field, enter the path of the original data file, for example, c:\data. 5. In the Destination folder text field, enter the path for the data on the compute nodes, for example, c:\data. 6. In the Target file(s) text field, enter the name of the file (or files) you want to copy. If you leave this blank, all the files in the source folder are copied. 7. Specify the Number of simultaneous copies from source; this defaults to 3. 8. Click Copy to Nodes (Distribute). 7 Continuing with Distributed Computing With the linear model and logistic regression performed in the previous sections, you have seen a taste of high-performance analytics on the HPC Server platform. You are now ready to continue with the RevoScaleR 7.3 Distributed Computing Guide (RevoScaleR_Distributed_Computing.pdf), which continues the analysis of the 2012 airline on-time data with examples for all of RevoScaleR s HPA functions. You will find this analysis in Chapter 3 of the guide, Running Distributed Analyses. The Distributed Computing Guide also provides more examples of using non-waiting compute contexts, including managing multiple jobs, and examples of using rxexec to perform traditional high-performance computing, including Monte Carlo simulations and other embarrassingly parallel problems.