R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs

Similar documents

High Performance Predictive Analytics in R and Hadoop:

Using Microsoft R Server to Address Scalability Issues

Revolution R Enterprise: Efficient Predictive Analytics for Big Data

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Revolution R Enterprise

In-Database Analytics Deep Dive with Teradata and Revolution R

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

Find the Hidden Signal in Market Data Noise

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

RevoScaleR Speed and Scalability

Modern Data Architecture for Predictive Analytics

Advanced Big Data Analytics with R and Hadoop

Fast Analytics on Big Data with H20

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

In-Database Analytics

Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Integrating Apache Spark with an Enterprise Data Warehouse

Building and Deploying Customer Behavior Models

Journée Thématique Big Data 13/03/2015

Understanding the Benefits of IBM SPSS Statistics Server

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Bayesian networks - Time-series models - Apache Spark & Scala

HIGH PERFORMANCE ANALYTICS FOR TERADATA

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

SEIZE THE DATA SEIZE THE DATA. 2015

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Dell* In-Memory Appliance for Cloudera* Enterprise

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Advanced In-Database Analytics

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

SQL Server Everything built-in. Csom Gergely Microsoft Adat platform szakértő

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

HDP Hadoop From concept to deployment.

How To Make A Credit Risk Model For A Bank Account

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Using In-Memory Computing to Simplify Big Data Analytics

Unified Big Data Processing with Apache Spark. Matei

Empowering the Masses with Analytics

High-Performance Analytics

Big Data and Data Science: Behind the Buzz Words

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

EVERYTHING THAT MATTERS IN ADVANCED ANALYTICS

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Architectures for Big Data Analytics A database perspective

MSCA Introduction to Statistical Concepts

ANALYTICS CENTER LEARNING PROGRAM

Predictive Modeling Techniques in Insurance

Building Data-Driven Internet of Things (IoT) Applications

Spark and the Big Data Library

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Oracle Big Data Building A Big Data Management System

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

L3: Statistical Modeling with Hadoop

PARALLELS CLOUD STORAGE

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

Introducing Oracle Exalytics In-Memory Machine

Using DeployR to Solve the R Integration Problem

Comprehensive Analytics on the Hortonworks Data Platform

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

The Future of Data Management

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

MSCA Introduction to Statistical Concepts

Making big data simple with Databricks

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

How To Write A Trusted Analytics Platform (Tap)

Actian Vector in Hadoop

Cisco Data Preparation

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

Big Data. Fast Forward. Putting data to productive use

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Hurwitz ValuePoint: Predixion

Batter Up! Advanced Sports Analytics with R and Storm

Big Data Analytics in R

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Transcription:

R and Hadoop: Architectural Options Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs

Polling Question #1: Who Are You? (choose one) Statistician or modeler who uses R Other R developer Hadoop Expert Application builder Data guru Business user Systems vendor or reseller Something else

Agenda Challenges Options Considerations How to Choose

Boundless Opportunities Marketing: Clickstream & Campaign Analyses Digital Media: Recommendation Engines Retail: Social Sentiment Analysis Insurance: Fraud Waste and Abuse Healthcare Delivery: Outcome Prediction Manufacturing: Quality Optimization P&C Insurance: Risk Analysis Consumer Products: Warranty Optimization Operations: Supply Chain Optimization Econometrics: Market Prediction Marketing: Mix and Price Optimization Life Sciences: Pharmacogenetics Transportation: Asset Utilization

Polling Question #2: What Industry Do You Represent? Financial Services Insurance Healthcare, Life Sciences or Pharma Manufacturing Energy Retail Logistics and Transportation Education Government Marketing & Advertising Technology Other

In A Perfect World Analytical Capability Security Compute Ease Data Scale Price Users

Hadoop Analytics - Many Alternatives R Based Alternatives Legacy tools updated SAS HPA, etc. Big Data Databases Other Languages Scala, Java, Julia, various GUIs Today s Topic: R-Based Alternatives Beside Architectures Inside Architectures Open Source and Commercial

Reality: Tradeoffs. Traditional Statistics vs. Machine Learning In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs. Automatic Distribution Real-Time vs. MapReduce Locality vs. Movement Memory Limits

No Magic Bullet.

Corporate Overview & Quick Facts Revolution R Enterprise is the leading commercial analytics platform based on the open source R statistical computing language Founded Office Locations CEO 2008 (as REvolution Computing) Palo Alto (HQ), Seattle (Engineering) Singapore London David Rich Number of customers 200+ Investors Northbridge Venture Partners Intel Capital Platform Vendor Web site: www.revolutionanalytics.com

Revolution Analytics Our Vision: R becomes the defacto standard for enterprise predictive analytics Our Mission: Drive enterprise adoption of R by providing enhanced R products tailored to meet enterprise challenges

Revolution Analytics Builds & Delivers: Software Products: Support & Services Stable Distributions Commercial Support Programs Broad Platform Support Training Programs Professional Services Big Data Analytics in R Application Integration Community Programs Deployment Platforms Academic Support Programs Agile Development Tooling Contributions to Open Source R Future Platform Support Open Source Extensions Sponsorship of R User Groups

Revolution Analytics Technical Innovations R Options from Open Source Production Deployment to Enterprise Support Parallelized Analytical Computation In-Database & In-Hadoop Analytics Big Data Scalability Multi-Platform Deployment Legacy Data Format Support Multiple IDE Options PMML Model Export Remote Execution

The Revolution R Product Suite Revolution R Open Free and open source R distribution Enhanced and distributed by Revolution Analytics Revolution R Plus Open-source distribution of R, packages, and other components Enhanced, supported and indemnified by Revolution Analytics Revolution R Enterprise Secure, Scalable and Supported Distribution of R With proprietary components created by Revolution Analytics

Polling Question #3: State Play: In your company you are Building Our Data Lake Running R + Hadoop Data Today Running R inside Hadoop using Open source Running RRE inside Hadoop Deploying Business Apps. Using Analytics from Hadoop Data Looking at Next Steps e.g. Spark, etc.

Revolution Analytics: Eight Alternatives for Integrating R & Hadoop Open Source 1. Open Source R 2. Revolution R Open 3. Open Source Parallelization on Workstations & Servers 4. rhadoop: Open Source Parallelization with rhadoop Commercial 5. Revolution R Enterprise on Servers & Workstations 6. Revolution R Enterprise on Edge Nodes 7. Revolution R Enterprise Inside Hadoop 8. Combined Edge Node & Inside Hadoop

1. Open Source R Integrated With Hadoop Traditional Open Source R Beside Architecture: CRAN Algorithms rodb C rhdfs rhbas e rhive Traditional Open Source Memory- Limited Data Moves

2. Revolution R Open On Workstations & Servers Replace Open Source R Beside Architecture with Revolution R Open CRAN Algorithms rodb C rhdfs rhbas e rhive As with Open Source R: Still Free. Still Memory Based. Data Still Moves. Improvements: Accelerates Math with Intel MKL Improves R-based packages Limitations No Effect for non-r Code

Accelerate R Math with Intel Math Kernel Lib s. Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html

3. Write Parallel Algorithms PC, Server or Clusters Write R Code to Explicitly Parallelize Deploy Across Several Systems ForEach & Iterator DoParallel (PC, server) DoMPI (cluster) RRE RxEXEC Example Uses: Bootstrapping Simulation HPC Can Include CRAN Algorithms Carefully rodb C rhdfs rhbas e rhive As with Previous: Still Free. Still Memory Based. Data Still Moves. Intel MKL with RRO Improvements: Parallelized Execution Limitations: Parallelization Difficulty Data Movement Platform Specific

4. rhadoop: Custom Parallel Execution for Hadoop Execute R Code & CRAN Algorithms Inside Hadoop Remote Desktop Example Uses: Scoring Transformation Easily Parallelized Algorithms R Code rmapreduce Hadoop Streaming Can Include CRAN Algorithms rhbase rhdfs As With Previous: Still Free. Optional Intel MKL in RRO Improvements: Runs R in MapReduce No Data Movement Limitations: Manual Parallelization Hadoop Specific

5. Revolution R Enterprise (RRE) PEMAs inside Hadoop Traditional Beside Architecture with Optimized Algorithms Available for Windows, Linux As With Previous: Includes Intel MKL in RRO Revolution R Enterprise: ScaleR PEMA Algorithms plus All of CRAN (subject to memory limits) rodb C rhdfs rhbas e rhive Advantages Speed: PEMAs Parallelize Across Threads, Cores & Sockets Scale: PEMAs Chunk - no Memory Limits All of CRAN Available Portability Fully Supported Limitations: Data Movement Single Machine

Revolution R Enterprise is. the only big data big analytics platform based on open source R High Performance, Scalable Analytics Portable Across Enterprise Platforms Easier to Build & Deploy Analytics

ScaleR Refactor Algorithms for Dramatic Performance and Capacity Improvement

ScaleR High Performance Algorithms for the Most Common Uses Data Step Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Sampling Subsample (observations & variables) Random Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Revolution Analytics Confidential Under NDA Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification New in Decision Trees 7.3 Decision Forests Gradient Boosted Decision Trees Combination PEMA-R API rxdatastep rxexec 25

What s a PEMA? Parallel External Memory Algorithms Script Calls ScaleR Algorithm Scripts can call CRAN Open Source Algorithms Start & Manage Processing Master Algorithm Process Combine Individual Results ScaleR PEMA Load Block At A Analyze Each Time Block Data Not Limited to Available Memory Unlimited Data Scale Ingests Data One Chunk At A Time. Adjustable Memory Footprint Multi-Thread Execution Performance Highly-Optimized Algorithms Algorithm Math Fully Refactored for Parallelism Delivered as ScaleR Library in Revolution R Enterprise

6. Run Revolution R Enterprise on Hadoop Edge Node(s) Fast Single-Server Alternative for Modest Data Scale (opt.) Thin Client or Remote Desktop ScaleR + CRAN Algorithms Edge Node rodb C rhdfs rhbas e rhive Local File System As With Previous: Single Machine Execution PEMA Scale & Speed (Single Machine) Use ScaleR + CRAN Accelerate R with Intel MKL Improvements: Easily Shared via No Data Movement Develop on Desktop Run on Edge Node Limitations: Shorter Trip for Data

7. Fast, Transparent Parallel Computation Inside Hadoop YARN/MapReduce Fast Parallelized Analytics on Large Data Sets In Hadoop Desktop & Server Tools and Applications We Web b Services vice s DeployR Remote Execution jobtracker ScaleR Algorithms As With Previous: Speed and Scale of ScaleR PEMA Algorithms Use CRAN Where Appropriate Accelerate R Math with MKL Custom Parallelized Algo s Advantages Parallel Computation No Data Movement ScaleR PEMA Parallelization Can Parallelize CRAN Carefully Portable Coding Limitations: Hadoop Workload Profiles

One Client s Experience with RRE on Hadoop Test Cluster - 9 Nodes Task Processing Time Importing and Filtering Datasets from HDFS 14 Million Observations 82 sec. 227 Million Observations 310 sec. Modeling and Estimation 1 Edge Node 2 Admin Nodes 9 Task Nodes 1.2 M Correlations 2771 sec. Simple Linear Regression, 227 M Observations 61 sec. Multiple Linear Regression, Three Variables, 227 M Observations Multiple Linear Regression, Four Variables, 227 M Observations 58 sec. 58 sec. 128GB 24 cores each 128GB 24 cores each 64GB 24 cores each Random Forest, 10 Predictor Variables, 227 M Observations, 10 Trees with Max Depth of 10 Splits 2 hr. 3 min. 29

8. Combined Edge Node & In-Hadoop Maximized Flexibility, Performance & Workload Handling Thin Client Development Remote Execution ScaleR Algorithms As With Previous: Speed and Scale of ScaleR PEMA Algorithms Use CRAN Where Appropriate Accelerate R Math with MKL Custom Parallelized Algo s Desktop & Server We Tools and b Applications Ser vice s rstudio DeployR Advantages Flexibility for Blended Workloads Little or No Data Movement Maximize CRAN Capabilities by Sharing Large RAM Edge Nodes

Occasionally Conflicting Criteria Infrastructure Criteria: Big Data Platform Vendor Choice Data Ingest Data Security Data Governance Data Science Criteria: Performance Self Service Flexibility Collaboration Sharing Capability

Key Questions: Where are the bulk of your skills? SAS? R? Java? Python? SQL? Where do you build models today? Do you have the skills to parallelize algorithms? Can models be built on a big shared server? How will you run models? Do you have the budget to purchase commercial solutions? How will your needs change over time? What is your future architecture plan? How risk averse is your management team regarding new platforms and open source?

Key Questions (cont.) What Workloads Do You Anticipate? How May Users? What Workloads? Workload Realities: Many small tasks do not run well in MapReduce Large data movements / duplications are costly What Use Cases Will You Encounter? Traditional statistical exploration, modeling? Behavior Prediction? Outlier Detection? Simulation and HPC? Massively wide data? Real-Time scoring? Internet of Things?

Eight Steps to Fast, Scalable R Analytics with Hadoop Open Source Options 1. Open Source R 2. Revolution R Open 3. Open Source Parallelization 4. rhadoop Commercial Options 5. RRE on Servers & Workstations 6. RRE on Edge Nodes 7. RRE Inside Hadoop 8. RRE on Edge Node & Inside Hadoop No Clear Winner: Budget & use case determine optimal path Compelling options in both open source & commercial source RRE ScaleR uniquely provides automatic parallelization Current Hadoop platforms are fast for large scale analytics. Combined in-server & in-hadoop fits majority of cases

2015 Challenges & Opportunities Evolving Hadoop Architectures In-Memory Analytics Spark, YARN Containers, Caching Additional Algorithm Parallelization Cluster Management Cloud and Hybrid Cloud Clusters SQL on Hadoop Battle-Royale Addressing the Resource Reality Integration, Deployment Both Drain on Expensive Resources Leverage other skills Design efficient collaboration Analytics for the Rest of Us New Consumption Targets Mobile New Participants in Design Business Users

Recommended Resources Revolution Analytics Products http://www.revolutionanalytics.com/products http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws Whitepaper: Delivering Value from Big Data with Revolution R Enterprise and Hadoop http://www.revolutionanalytics.com/whitepaper/delivering-value-big-datarevolution-r-enterprise-and-hadoop Revolution Analytics on Social Media: http://blog.revolutionanalytics.com/ @revolutionr on Twitter @bill_jacobs on Twitter

Thank you. www.revolutionanalytics.com 1.855.GET.REVO Twitter: @RevolutionR