Pattern an open source project for migrating predictive models from SAS, etc., onto Hadoop. Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid

Similar documents

Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale

The Internet of Things and Big Data: Intro

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Big Data and Data Science: Behind the Buzz Words

ANALYTICS CENTER LEARNING PROGRAM

Creating Big Data Applications with Spring XD

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

@Scalding. Based on talk by Oscar Boykin / Twitter

BIRT in the World of Big Data

This Symposium brought to you by

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

MapReduce with Apache Hadoop Analysing Big Data

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data for the JVM developer. Costin Leau,

BIG DATA SOLUTION DATA SHEET

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Building Your Big Data Team

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Azure Machine Learning, SQL Data Mining and R

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Advanced Big Data Analytics with R and Hadoop

The 4 Pillars of Technosoft s Big Data Practice

What s Cooking in KNIME

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Hadoop & Spark Using Amazon EMR

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Unified Big Data Processing with Apache Spark. Matei

BIG DATA What it is and how to use?

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Implement Hadoop jobs to extract business value from large and varied data sets

Machine Learning with MATLAB David Willingham Application Engineer

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Advanced In-Database Analytics

Openbus Documentation

Moving From Hadoop to Spark

Integrating a Big Data Platform into Government:

Oracle Big Data SQL Technical Update

Comprehensive Analytics on the Hortonworks Data Platform

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Luncheon Webinar Series May 13, 2013

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Cascading 2 User Guide

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

HADOOP. Revised 10/19/2015

Model Deployment. Dr. Saed Sayad. University of Toronto

Fast Analytics on Big Data with H20

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Data Mining Algorithms Part 1. Dejan Sarka

The R pmmltransformations Package

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Native Connectivity to Big Data Sources in MSTR 10

Reference Architecture, Requirements, Gaps, Roles

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Big Data Analytics and Optimization

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Distributed Computing and Big Data: Hadoop and MapReduce

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Workshop on Hadoop with Big Data

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Using distributed technologies to analyze Big Data

Customer Case Study. Sharethrough

HDP Enabling the Modern Data Architecture

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Testing Big data is one of the biggest

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

AtScale Intelligence Platform

Performance and Scalability Overview

Sunnie Chung. Cleveland State University

2015 Workshops for Professors

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Manifest for Big Data Pig, Hive & Jaql

In-Memory BigData. Summer 2012, Technology Overview

Data Mining + Business Intelligence. Integration, Design and Implementation

Hadoop: The Definitive Guide

Performance and Scalability Overview

Databricks. A Primer

Databricks. A Primer

How To Scale Out Of A Nosql Database

HDP Hadoop From concept to deployment.

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

An Introduction to Data Mining

Big Data Analytics Nokia

Transcription:

Pattern an open source project for migrating predictive models from SAS, etc., onto Hadoop Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid

employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

Cascading origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce potential blocker for leveraging new open source technology.

Cascading functional programming Key insight: MapReduce is based on functional programming back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as Main Street Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: leverages JVM and Java-based tools without any need to create new languages allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters

Cascading definitions a pattern language for Enterprise Data Workflows Customers simple to build, easy to test, robust in production Web App design principles ensure best practices at scale Support trap logs logs Logs source Cache sink Modeling PMML Data Workflow sink source Analytics Cubes Reporting Hadoop Cluster customer profile Customer DBs Prefs

Cascading usage Java API, DSLs in Scala, Clojure, Jython, JRuby, Groovy, ANSI SQL Customers Web App ASL 2 license, GitHub src, http://conjars.org 5+ yrs production use, multiple Enterprise verticals Support Modeling PMML trap logs logs Logs source Data Workflow Cache sink sink source Analytics Cubes Reporting Hadoop Cluster customer profile Customer DBs Prefs

Cascading integrations partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera s: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc. serialization: Avro, Thrift, Kryo, JSON, etc. Support trap logs logs Logs source Customers Web App Cache sink topologies: Apache Hadoop, tuple spaces, local mode Modeling PMML sink Data Workflow source Analytics Cubes Reporting Hadoop Cluster customer profile Customer DBs Prefs

Cascading deployments case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uswitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, ecrm, utility grids, telecom, genomics, climatology, agronomics, etc.

Cascading deployments case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uswitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, ecrm, utility grids, telecom, genomics, climatology, agronomics, etc. workflow abstraction addresses: staffing bottleneck; system integration; operational complexity; test-driven development

employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

Enterprise Data Workflows Let s consider a strawman architecture for an example app at the front end LOB use cases drive demand for apps Customers Web App Support trap logs logs Logs source Cache sink Modeling Analytics Cubes Reporting PMML sink Data Workflow Hadoop Cluster source customer profile Customer DBs Prefs

Enterprise Data Workflows Same example in the back office Organizations have substantial investments in people, infrastructure, process Customers Web App Support trap logs logs Logs source Cache sink Modeling Analytics Cubes Reporting PMML sink Data Workflow Hadoop Cluster source customer profile Customer DBs Prefs

Enterprise Data Workflows Same example the heavy lifting! Main Street firms are migrating workflows to Hadoop, for cost savings and scale-out Support trap logs logs Logs source Customers Web App Cache sink Modeling Analytics Cubes Reporting PMML sink Data Workflow Hadoop Cluster source customer profile Customer DBs Prefs

Cascading workflows s s integrate other data frameworks, as tuple streams these are plumbing endpoints in the pattern language sources (inputs), sinks (outputs), traps (exceptions) text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc. data serialization: Avro, Thrift, Kryo, JSON, etc. extend a new kind of in just a few lines of Java schema and provenance get derived from analysis of the s Support Modeling Analytics Cubes Reporting PMML trap sink logs logs Logs source Data Workflow Hadoop Cluster Customers Web App Cache sink source customer profile Customer DBs Prefs

Cascading workflows s String docpath = args[ 0 ]; String wcpath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink s Tap doctap = new Hfs( new TextDelimited( true, "\t" ), docpath ); Tap wctap = new Hfs( new TextDelimited( true, "\t" ), wcpath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\ (\\),.]" ); // only returns "token" Pipe docpipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcpipe = new Pipe( "wc", docpipe ); wcpipe = new GroupBy( wcpipe, token ); wcpipe = new Every( wcpipe, Fields.ALL, new Count(), Fields.ALL ); // connect the s, pipes, etc., into a flow FlowDef flowdef = FlowDef.flowDef().setName( "wc" ).addsource( docpipe, doctap ).addtailsink( wcpipe, wctap ); // write a DOT file and run the flow Flow wcflow = flowconnector.connect( flowdef ); wcflow.writedot( "dot/wc.dot" ); wcflow.complete(); source and sink s for TSV data in HDFS

Cascading workflows topologies topologies execute workflows on clusters flow planner is like a compiler for queries Customers - Hadoop (MapReduce jobs) Web App - - local mode (dev/test or special config) in-memory data grids (real-time) flow planner can be extended to support other topologies Support Modeling PMML trap logs logs Logs source Data Workflow Cache sink blend flows in different topologies into the same app for example, batch (Hadoop) + transactions (IMDG) Analytics Cubes Reporting sink Hadoop Cluster source customer profile Customer DBs Prefs

Cascading workflows topologies String docpath = args[ 0 ]; String wcpath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink s Tap doctap = new Hfs( new TextDelimited( true, "\t" ), docpath ); Tap wctap = new Hfs( new TextDelimited( true, "\t" ), wcpath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\ (\\),.]" ); // only returns "token" Pipe docpipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcpipe = new Pipe( "wc", docpipe ); wcpipe = new GroupBy( wcpipe, token ); wcpipe = new Every( wcpipe, Fields.ALL, new Count(), Fields.ALL ); // connect the s, pipes, etc., into a flow FlowDef flowdef = FlowDef.flowDef().setName( "wc" ).addsource( docpipe, doctap ).addtailsink( wcpipe, wctap ); // write a DOT file and run the flow Flow wcflow = flowconnector.connect( flowdef ); wcflow.writedot( "dot/wc.dot" ); wcflow.complete(); flow planner for Apache Hadoop topology

Cascading workflows test-driven development assert patterns (regex) on the tuple streams adjust assert levels, like log4j levels Customers trap edge cases as data exceptions Web App TDD at scale: 1. 2. 3. start from raw inputs in the flow graph define stream assertions for each stage of transforms verify exceptions, code to remove them Support Modeling PMML trap logs logs Logs source Data Workflow Cache sink 4. when impl is complete, app has full test coverage redirect traps in production to Ops, QA, Support, Audit, etc. Analytics Cubes Reporting sink Hadoop Cluster source customer profile Customer DBs Prefs

Workflow Abstraction pattern language Cascading uses a plumbing mehor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Tokenize Scrub token M Stop Word List HashJoin Left RHS Regex token GroupBy token R Data is represented as flows of tuples. Operations within the flows bring functional programming aspects into Java In formal terms, this provides a pattern language Count Word Count

Pattern Language structured method for solving large, complex design problems, where the syntax of the language ensures the use of best practices i.e., conveying expertise employee Join Count leads PMML classifier quarterly sales bonus allocation A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199 Failure Traps

Workflow Abstraction literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Tokenize Scrub token M Stop Word List HashJoin Left RHS Regex token GroupBy token R in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps great for cross-team collaboration Count Word Count

Literate Programming by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

Workflow Abstraction business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: Specify what you require, not how to achieve it. by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale

Business Process by Edgar Codd A relational model of data for large shared data banks Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 rather than arguing between SQL vs. NoSQL structured vs. unstructured data frameworks this approach focuses on what apps do: the process of structuring data

Cascading functional programming Twitter, ebay, LinkedIn, Nokia, YieldBot, uswitch, etc., have invested in open source projects atop Cascading used for their large-scale production deployments new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming Practices Will Improve Your Return from Technology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programmingpractices-will-improve-your-return-from-technology/

Functional Programming for Big Data WordCount with token scrubbing Apache Hive: 52 lines HQL + 8 lines Python (UDF) compared to Scalding: 18 lines Scala/Cascading functional programming languages help reduce software engineering costs at scale, over time

Two Avenues to the App Layer Enterprise: must contend with complexity at scale everyday incumbents extend current practices and infrastructure investments using J2EE, ANSI SQL, SAS, etc. to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding complexity scale

employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

PMML standard established XML standard for predictive model markup organized by Data Mining Group (DMG), since 1997 http://dmg.org/ members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application. wikipedia.org/wiki/predictive_model_markup_language

PMML model coverage Association Rules: AssociationModel element Cluster Models: ClusteringModel element Decision Trees: TreeModel element Naïve Bayes Classifiers: NaiveBayesModel element Neural Networks: NeuralNetwork element Regression: RegressionModel and GeneralRegressionModel elements Rulesets: RuleSetModel element Sequences: SequenceModel element Support Vector Machines: SupportVectorMachineModel element Text Models: TextModel element Time Series: TimeSeriesModel element ibm.com/developerworks/industry/library/ind-pmml2/

PMML vendor coverage

employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

Pattern model scoring migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers great open source tools R, Weka, KNIME, Matlab, RapidMiner, etc. Web App integrate with other libraries Matrix API, etc. leverage PMML as another kind of DSL Support Modeling PMML trap logs logs Logs source Data Workflow Cache sink sink source cascading.org/pattern Analytics Cubes Reporting Hadoop Cluster customer profile Customer DBs Prefs

Pattern create a model in R ## train a RandomForest model f <- as.formula("as.factor(label) ~.") fit <- randomforest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=false, sep="\t", row.names=false) ## export RF model to PMML savexml(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

Pattern capture model parameters as PMML <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/pmml-4_0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.dmg.org/pmml-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="copyright (c)2012 Concurrent, Inc." description="random Forest Tree Model"> <Extension name="user" value="ceteri" extender="rattle/pmml"/> <Application name="rattle/pmml" version="1.2.30"/> <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberoffields="4"> <DataField name="label" optype="categorical" datatype="string"> <Value value="0"/> <Value value="1"/> </DataField> <DataField name="var0" optype="continuous" datatype="double"/> <DataField name="var1" optype="continuous" datatype="double"/> <DataField name="var2" optype="continuous" datatype="double"/> </DataDictionary> <MiningModel modelname="randomforest_model" functionname="classification"> <MiningSchema> <MiningField name="label" usagetype="predicted"/> <MiningField name="var0" usagetype="active"/> <MiningField name="var1" usagetype="active"/> <MiningField name="var2" usagetype="active"/> </MiningSchema> <Segmentation multiplemodelmethod="majorityvote"> <Segment id="1"> <True/> <TreeModel modelname="randomforest_model" functionname="classification" algorithmname="randomforest" splitcharacteristic="binarysplit"> <MiningSchema> <MiningField name="label" usagetype="predicted"/> <MiningField name="var0" usagetype="active"/> <MiningField name="var1" usagetype="active"/> <MiningField name="var2" usagetype="active"/> </MiningSchema>...

Pattern score a model, within an app public static void main( String[] args ) throws RuntimeException { String inputpath = args[ 0 ]; String classifypath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink s Tap inputtap = new Hfs( new TextDelimited( true, "\t" ), inputpath ); Tap classifytap = new Hfs( new TextDelimited( true, "\t" ), classifypath ); // handle command line options OptionParser optparser = new OptionParser(); optparser.accepts( "pmml" ).withrequiredarg(); OptionSet options = optparser.parse( args ); // connect the s, pipes, etc., into a flow FlowDef flowdef = FlowDef.flowDef().setName( "classify" ).addsource( "input", inputtap ).addsink( "classify", classifytap ); if( options.hasargument( "pmml" ) ) { String pmmlpath = (String) options.valuesof( "pmml" ).get( 0 ); PMMLPlanner pmmlplanner = new PMMLPlanner().setPMMLInput( new File( pmmlpath ) ).retainonlyactiveincomingfields().setdefaultpredictedfield( new Fields( "predict", Double.class ) ); // default value if missing from the model flowdef.addassemblyplanner( pmmlplanner ); } // write a DOT file and run the flow Flow classifyflow = flowconnector.connect( flowdef ); classifyflow.writedot( "dot/classify.dot" ); classifyflow.complete(); }

Pattern score a model, using pre-defined Cascading app Customer Orders Classify Scored Orders Assert GroupBy token M R PMML Model Count Failure Traps Confusion Matrix cascading.org/pattern

Pattern score a model, using pre-defined Cascading app ## run an RF classifier at scale hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \ --pmml data/sample.rf.xml ## run an RF classifier at scale, assert regression test, measure confusion matrix hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \ --pmml data/sample.rf.xml --assert --measure out/measure ## run a predictive model at scale, measure RMSE hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \ --pmml data/iris.lm_p.xml --rmse out/measure

Roadmap existing algorithms for scoring Random Forest Decision Trees Linear Regression GLM Logistic Regression K-Means Clustering Hierarchical Clustering Multinomial Support Vector Machines (prepared for release) also, model chaining and general support for ensembles cascading.org/pattern

Roadmap next priorities for scoring Time Series (ARIMA forecast) Association Rules (basket analysis) Naïve Bayes Neural Networks algorithms extended based on customer use cases contact groups.google.com/forum/?fromgroups#!forum/pattern-user cascading.org/pattern

Roadmap top priorities for creating models at scale Random Forest Logistic Regression K-Means Clustering Association Rules plus all models which can be trained via sparse matrix factorization (TQSR => PCA, SVD least squares, etc.) a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop cascading.org/pattern

employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

Experiments comparing models much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale run multiple variants, then measure relative lift Concurrent runtime tag and track models the following example compares two models trained with different machine learning algorithms this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment

Experiments Random Forest model ## train a Random Forest model ## example: http://mkseo.pe.kr/stats/?p=220 f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomforest(f, data=data, proximity=true, ntree=25) print(fit) savexml(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478

Experiments Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) savexml(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0-1.3755 0.4355-3.159 0.00159 ** var2-3.7742 0.5794-6.514 7.30e-11 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 NB: this model has var1 intentionally omitted

Experiments comparing results use a confusion matrix to compare results for the classifiers Logistic Regression has a lower false negative rate (5% vs. 11%) however it has a much higher false positive rate (52% vs. 14%) assign a cost model to select a winner for example, in an ecommerce anti-fraud classifier: FN chargeback risk FP customer support costs

employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

Two Cultures A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets. Statistical Modeling: The Two Cultures Leo Breiman, 2001 bit.ly/euth9l in other words, seeing the forest for the trees this paper chronicled a sea change from data modeling practices (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization)

Why Do Ensembles Matter? The World The World per Data Modeling

Algorithmic Modeling The trick to being a scientist is to be open to using a wide variety of tools. Breiman circa 2001: Random Forest, bootstrap aggregation, etc., yield dramatic increases in predictive power over earlier modeling such as Logistic Regression major learnings from the Netflix Prize: the power of ensembles, model chaining, etc. the problems at hand have become simply too big and too complex for ONE distribution, ONE model, ONE team

Ensemble Models Breiman: a multiplicity of data models BellKor team: 100+ individual models in 2007 Progress Prize while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially Ensemble Learning: Better Predictions Through Diversity Todd Holloway ETech (2008) abeautifulwww.com/ensemblelearningetech.pdf The Story of the Netflix Prize: An Ensemblers Tale Lester Mackey National Academies Seminar, Washington, DC (2011) stanford.edu/~lmackey/papers/

KDD 2013 PMML Workshop Pattern: PMML for Cascading and Hadoop Paco Nathan, Girish Kathalagiri Chicago, 2013-08-11 (accepted) 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining kdd13pmml.wordpress.com

employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources end uses

Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies ANSI SQL for ETL ETL data prep predictive model data sources end uses

Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources J2EE for business logic end uses

Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies SAS for predictive models ETL data prep predictive model data sources end uses

Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies ANSI SQL for ETL most of the licensing costs SAS for predictive models ETL data prep predictive model data sources end uses

Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies most of the project costs ETL data prep predictive model data sources J2EE for business logic end uses

Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app one among many, typically based on 100% open source Lingual: DW ANSI SQL business logic in Java, Clojure, Scala, etc. Pattern: SAS, R, etc. PMML cascading.org ETL data prep predictive model data sources a compiler sees it all end uses source s for Cassandra, JDBC, Splunk, etc. sink s for Memcached, HBase, MongoDB, etc.

Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app one among many, typically based on 100% open source Lingual: DW ANSI SQL business logic in Java, Clojure, Scala, etc. Pattern: SAS, R, etc. PMML cascading.org data sources ETL FlowDef dataflowdef = FlowDef.flowDef() predictive.setname( prep "etl" ) model.addsource( "example.employee", empltap ).addsource( "example.sales", salestap ).addsink( "results", resultstap ); SQLPlanner sqlplanner = new SQLPlanner().setSql( sqlstatement ); flowdef.addassemblyplanner( sqlplanner end ); uses a compiler sees it all source s for Cassandra, JDBC, Splunk, etc. sink s for Memcached, HBase, MongoDB, etc.

Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app one among many, typically based on 100% open source Lingual: DW ANSI SQL business logic in Java, Clojure, Scala, etc. Pattern: SAS, R, etc. PMML data FlowDef flowdef = FlowDef.flowDef() ETL.setName( "classifier" ) prep.addsource( "input", inputtap ).addsink( "classify", classifytap ); PMMLPlanner pmmlplanner = new PMMLPlanner().setPMMLInput( new File( pmmlmodel ) ).retainonlyactiveincomingfields(); data flowdef.addassemblyplanner( pmmlplanner ); sources a compiler sees it all predictive model end uses source s for Cassandra, JDBC, Splunk, etc. sink s for Memcached, HBase, MongoDB, etc.

Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components employee into an integrated app one among many, typically based on 100% open source cascading.org quarterly sales Join Lingual: DW ANSI SQL ETL Count leads business logic in Java, Clojure, Scala, etc. PMML classifier data prep Pattern: SAS, R, etc. PMML bonus allocation predictive model data sources Failure Traps source s for Cassandra, JDBC, Splunk, etc. end uses visual collaboration for the business logic is a great way to improve how teams work together sink s for Memcached, HBase, MongoDB, etc.

Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components employee into an integrated app one among many, typically based on 100% open source cascading.org quarterly sales data sources multiple departments, working in their respective Lingual: DW ANSI SQL ETL business logic in Java, Clojure, Scala, etc. data prep Pattern: SAS, R, etc. PMML frameworks, integrate results into a combined app, Join Count leads PMML classifier which runs at scale on a cluster business process bonus allocation predictive model combined in a common space (DAG) for flow planners, compiler, optimization, troubleshooting, exception handling, notifications, security audit, performance monitoring, etc. end uses Failure Traps source s for Cassandra, JDBC, Splunk, etc. sink s for Memcached, HBase, MongoDB, etc.

references Enterprise Data Workflows with Cascading O Reilly, 2013 amazon.com/dp/1449358721 newsletter updates: liber118.com/pxn/

acknowledgements Many thanks to others who have contributed code, ideas, suggestions, etc., to Pattern: Chris Wensel @ Concurrent Girish Kathalagiri @ AgilOne Vijay Srinivas Agneeswaran @ Impetus Chris Severs @ ebay Ofer Mendelevitch @ Hortonworks Sergey Boldyrev @ Nokia Quinton Anderson @ IZAZI Solutions Chris Gutierrez @ Airbnb Villu Ruusmann @ JPMML project

drill-down blog, developer community, code/wiki/gists, maven repo, commercial products, etc.: cascading.org zest.to/group11 github.com/cascading conjars.org goo.gl/kqtul concurrentinc.com