Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale

Similar documents
Pattern an open source project for migrating predictive models from SAS, etc., onto Hadoop. Paco Nathan Concurrent, Inc. San Francisco,

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Hadoop Tutorial. General Instructions

IDS 561 Big data analytics Assignment 1

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Magento Search Extension TECHNICAL DOCUMENTATION

Primavera P6 Professional Windows 8 Installation Instructions. Primavera P6. Installation Instructions. For Windows 8 Users

Installing Java (Windows) and Writing your First Program

Code Estimation Tools Directions for a Services Engagement

How to Run Spark Application

Installing the Android SDK

Using SQL Developer. Copyright 2008, Oracle. All rights reserved.

ADFS 2.0 Application Director Blueprint Deployment Guide

QlikView, Creating Business Discovery Application using HDP V1.0 March 13, 2014

Installing (1.8.7) 9/2/ Installing jgrasp

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Creating a Java application using Perfect Developer and the Java Develo...

18.2 user guide No Magic, Inc. 2015

Practice Fusion API Client Installation Guide for Windows

Mobile Labs Plugin for IBM Urban Code Deploy

Application Servers - BEA WebLogic. Installing the Application Server

Hadoop Data Warehouse Manual

File S1: Supplementary Information of CloudDOE

Install BA Server with Your Own BA Repository

1) SETUP ANDROID STUDIO

New Technology Introduction: Android Studio with PushBot

SAP Predictive Analysis Installation

Oracle Managed File Getting Started - Transfer FTP Server to File Table of Contents

EMC Documentum Composer

Getting Started using the SQuirreL SQL Client

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Creating a universe on Hive with Hortonworks HDP 2.0

Configuring the BBj Jetty Web Server (rev10.02) for OSAS

User Manual - Help Utility Download MMPCT. (Mission Mode Project Commercial Taxes) User Manual Help-Utility

Model Deployment. Dr. Saed Sayad. University of Toronto

MicroStrategy Course Catalog

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

1. Product Information

Setting up Sudoku example on Android Studio

Online Backup Client User Manual Linux

Bulk Downloader. Call Recording: Bulk Downloader

Oracle VM Manager Template. An Oracle White Paper February 2009

Packaging and Deploying Java Projects in Forte

Oracle Fusion Middleware. 1 Oracle Team Productivity Center Server System Requirements. 2 Installing the Oracle Team Productivity Center Server

Using BAC Hadoop Cluster

RecoveryVault Express Client User Manual

Change Manager 5.0 Installation Guide

Eclipse installation, configuration and operation

HCIbench: Virtual SAN Automated Performance Testing Tool User Guide

A Sample OFBiz application implementing remote access via RMI and SOAP Table of contents

Fuse ESB Enterprise Installation Guide

Installation Guide Using Melio Clustered File System to Enable Migration of VMware Server across Hosts

Titanium Mobile: How-To

Online Backup Linux Client User Manual

Online Backup Client User Manual

Click Start > Control Panel > System icon to open System Properties dialog box. Click Advanced > Environment Variables.

SAIP 2012 Performance Engineering

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Online Backup Client User Manual

Talend for Data Integration guide

ORACLE NOSQL DATABASE HANDS-ON WORKSHOP Cluster Deployment and Management

LAB 1. Familiarization of Rational Rose Environment And UML for small Java Application Development

Automatic updates for Websense data endpoints

AmbrosiaMQ-MuleSource ESB Integration

Web-JISIS Reference Manual

Important Notice. (c) Cloudera, Inc. All rights reserved.

Overview. About Interstitial Ads: About Banner Ads: About Offer-Wall Ads: ADAttract Account & ID

QuickStart Guide for Managing Mobile Devices. Version 9.2

Web Dashboard User Guide

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Business Objects Enterprise version 4.1. Report Viewing

Online Backup Client User Manual

Installing Java. Table of contents

How to use the Eclipse IDE for Java Application Development

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Using The Hortonworks Virtual Sandbox

EMC Documentum Connector for Microsoft SharePoint

Java with Eclipse: Setup & Getting Started

Building Your Big Data Team

Monitoring Oracle Enterprise Performance Management System Release Deployments from Oracle Enterprise Manager 12c

CycleServer Grid Engine Support Install Guide. version 1.25

Java Software Development Kit (JDK 5.0 Update 14) Installation Step by Step Instructions

Service Integration course. Cassandra

RHEV 2.2: REST API INSTALLATION

Perceptive Intelligent Capture Solution Configration Manager

QuickStart Guide for Mobile Device Management

HDP Hadoop From concept to deployment.

1 Building, Deploying and Testing DPES application

Published. Technical Bulletin: Use and Configuration of Quanterix Database Backup Scripts 1. PURPOSE 2. REFERENCES 3.

LICENSE4J FLOATING LICENSE SERVER USER GUIDE

Home Course Catalog Schedule Pricing & Savings Training Options Resources About Us

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Novell Access Manager

Online Backup Client User Manual Mac OS

Online Backup Client User Manual Mac OS

IBM WebSphere Application Server V8.5 lab Basic Liberty profile administration using the job manager

Transcription:

Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale V1.0 September 12, 2013 Introduction Summary Cascading Pattern is a machine learning project within the Cascading development framework used to build enterprise data workflows. The Cascading framework provides an abstraction layer on top of Hadoop and other computing topologies. It allows enterprises to leverage existing skills and resources to build data processing applications on Apache Hadoop, without specialized Hadoop skills. Pattern, in particular, leverages an industry standard called Predictive Model Markup Language (PMML), which allows data scientists to leverage their favorite statistical & analytics tools such as R, Oracle, etc., to export predictive models and quickly run them on data sets stored in Hadoop. Pattern s benefits include reduced development costs, time savings and reduced licensing issues at scale all while leveraging Hadoop clusters, core competencies of analytics staff, and existing intellectual property in the predictive models. Page 1 of 9

Enterprise Use Cases In the context of an Enterprise application, Cascading Pattern would tend to fit in the diagram shown below. Analytics work in the back office typically is where predictive modeling gets performed -- using SAS, R, SPSS, Microstrategy, Teradata, etc. By using Cascading Pattern, predictive modeling can now be exported as PMML from a variety of analytics frameworks, then run on Apache Hadoop at scale. This saves licensing costs while allowing for applications to scale-out and allows for predictive modeling to be integrated directly within other business logic expressed as Cascading apps. Page 2 of 9

Target Audience Data Scientist / Analyst with intermediate experience with statistical modeling tools like SAS, R, Microstrategies and novice Java experience. Prerequisites: System requirements: Mac OS, Windows 7, Linux, Minimum 4GB RAM JDK 1.6 (Tutorial is not currently supported on JDK 1.8) Gradle 1.11 (Tutorial is not currently supported on Gradle 2.0) Hortonworks Data Platform Sandbox R and R-Studio Overview During the course of this tutorial you ll perform the following steps: Environment setup Download source code Model creation Cascading Build Enterprise App Additional Resources Step 1 Environment Setup In this section, we will go through the steps needed to setup your environment. 1. To set up Java for your environment, download Java (http://www.java.com/getjava/) and follow the installation instructions. Notes: Version 1.6.x was used to create the examples used here Get the JDK, not the JRE Install according to vendor instructions Be sure to set the JAVA_HOME environment variable correctly 2. To set up Gradle for your environment, download Gradle (http://www.gradle.org/downloads) and follow the installation instructions. Notes: Version 1.4 and later is required for some examples in this tutorial Be sure that you do not select Version 2.x Install according to vendor instructions Be sure to set the GRADLE_HOME environment variable correctly Page 3 of 9

3. Set up the Hortonworks Data Platform Sandbox, per vendor instructions. 4. Set up R and RStudio for your environment by visiting: http://cran.r-project.org/ http://www.rstudio.com/ide/download/ Step 2 Source Code In this section, we will 1. Navigate to GitHub (https://github.com/cascading/tutorials) to download the source code which we will be using. 2. In the bottom right corner of the screen, click Download ZIP to download a ZIP compressed archive of the source code for the Cascading Pattern project. 3. Once this has completed downloading, unzip and move the directory pattern to a location on your file system where you have space available to work. Step 3 Model Creation 1. Connect to the pattern directory, and then into its pattern-examples subdirectory. There is an example R script in examples/r/rf_pmml.r that creates a Random Forest Page 4 of 9

model. This is representative of a predictive model for an anti-fraud classifier used in e- commerce apps. ## train a RandomForest model f <- as.formula("as.factor(label) ~.") fit <- randomforest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=false, sep="\t", row.names=false) ## export RF model to PMML savexml(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) 2. Load the rf_pmml.r script into RStudio using the File menu and Open File... option. 3. Click the Source button in the upper middle section of the screen. Page 5 of 9

This will execute the R script and create the predictive model. The last line saves the predictive model into a file called sample.rf.xml as PMML. If you take a look at that PMML, it s XML and not optimal for humans to read, but efficient for machines to parse: <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/pmml-4_0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.dmg.org/pmml-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="copyright (c)2012 Concurrent, Inc." description="random Forest Tree Model"> <Extension name="user" value="ceteri" extender="rattle/pmml"/> <Application name="rattle/pmml" version="1.2.30"/> <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberoffields="4"> <DataField name="label" optype="categorical" datatype="string"> <Value value="0"/> <Value value="1"/> </DataField> <DataField name="var0" optype="continuous" datatype="double"/> <DataField name="var1" optype="continuous" datatype="double"/> <DataField name="var2" optype="continuous" datatype="double"/> </DataDictionary> <MiningModel modelname="randomforest_model" functionname="classification"> Page 6 of 9

<MiningSchema> <MiningField name="label" usagetype="predicted"/> <MiningField name="var0" usagetype="active"/> <MiningField name="var1" usagetype="active"/> <MiningField name="var2" usagetype="active"/> </MiningSchema> <Segmentation multiplemodelmethod="majorityvote"> <Segment id="1"> <True/> <TreeModel modelname="randomforest_model" functionname="classification" algorithmname="randomforest" splitcharacteristic="binarysplit"> <MiningSchema> <MiningField name="label" usagetype="predicted"/> <MiningField name="var0" usagetype="active"/> <MiningField name="var1" usagetype="active"/> <MiningField name="var2" usagetype="active"/> </MiningSchema>... Cascading Pattern supports additional models, as well as ensembles of the following models. General Regression Regression Clustering Tree Mining Step 4 Cascading Build 1. Now that we have a model created and exported as PMML, let s work on running it at scale atop Apache Hadoop. In the pattern-examples directory, execute the following Bash shell commands: gradle clean jar 2. That line invokes Gradle to run the build script build.gradle, and compile the Cascading Pattern example app. 3. After that compiles look for the built app as a JAR file in the build/libs subdirectory: ls -lts build/libs/pattern-examples-*.jar 4. Now we re ready to run this Cascading Pattern example app on Apache Hadoop. First, we make sure to delete the output results (required by Hadoop). Then we run Hadoop: we specify the JAR file for the app, the PMML file using a --pmml command line option, along with sample input data data/sample.tsv and the location of the output results: Page 7 of 9

rm -rf out hadoop jar build/libs/pattern-examples-*.jar \ data/sample.tsv out/classify --pmml data/sample.rf.xml 5. After that runs, check the out/classify subdirectory. Look at the results of running the PMML model, which will be in the part-* partition files: more out/classify/part-* 6. Let s take a look at what we just built and ran. The source code for this example is located in the src/main/java/cascading/pattern/main.java file: public class Main { /** @param args */ public static void main( String[] args ) throws RuntimeException { String inputpath = args[ 0 ]; String classifypath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap inputtap = new Hfs( new TextDelimited( true, "\t" ), inputpath ); Tap classifytap = new Hfs( new TextDelimited( true, "\t" ), classifypath ); // handle command line options OptionParser optparser = new OptionParser(); optparser.accepts( "pmml" ).withrequiredarg(); OptionSet options = optparser.parse( args ); // connect the taps, pipes, etc., into a flow FlowDef flowdef = FlowDef.flowDef().setName( "classify" ).addsource( "input", inputtap ).addsink( "classify", classifytap ); // build a Cascading assembly from the PMML description if( options.hasargument( "pmml" ) ) { String pmmlpath = (String) options.valuesof( "pmml" ).get( 0 ); Page 8 of 9

PMMLPlanner pmmlplanner = new PMMLPlanner().setPMMLInput( new File( pmmlpath ) ).retainonlyactiveincomingfields().setdefaultpredictedfield( new Fields( "predict", Double.class ) ); // default value if missing from the model flowdef.addassemblyplanner( pmmlplanner ); } } // write a DOT file and run the flow Flow classifyflow = flowconnector.connect( flowdef ); classifyflow.writedot( "dot/classify.dot" ); classifyflow.complete(); } Most of the code is the basic plumbing used for Cascading apps. The portions which are specific to Cascading Pattern and PMML are the few lines involving the pmmlplanner object. Additional Resources Community - http://www.cascading.org/ Cascading Pattern Home - http://www.cascading.org/pattern/ Cascading Lingual - http://www.cascading.org/lingual/ O Reilly Book Cascading - http://oreil.ly/143jst6 About Concurrent Concurrent, Inc. s vision is to become the #1 software platform choice for Big Data applications. Concurrent builds application infrastructure products that are designed to help enterprises create, deploy, run and manage data processing applications at scale on Apache Hadoop. Concurrent is the mind behind Cascading, the most widely used and deployed technology for Big Data applications with more than 90,000+ user downloads a month. Used by thousands of data driven businesses including Twitter, ebay, The Climate Corp, and Etsy, Cascading is the de-facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco. Visit Concurrent online at http://www.concurrentinc.com. About Hortonworks Hortonworks is the only 100-percent open source software provider to develop, distribute and support an Apache Hadoop platform explicitly architected, built and tested for enterprise-grade deployments. Developed by the original architects, builders and operators of Hadoop, Hortonworks stewards the core and delivers the critical services required by the enterprise to reliably and effectively run Hadoop at scale. Our distribution, Hortonworks Data Platform, provides an open and stable foundation for enterprises and a growing ecosystem to build and deploy big data solutions. Hortonworks also provides unmatched technical support, training and certification programs. For more information, visit http://www.hortonworks.com. Get started and Go from Zero to Hadoop in 15 Minutes with the Hortonworks Sandbox. Page 9 of 9