Revolution R Enterprise: Efficient Predictive Analytics for Big Data



Similar documents
High Performance Predictive Analytics in R and Hadoop:

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

In-Database Analytics Deep Dive with Teradata and Revolution R

Revolution R Enterprise

Using Microsoft R Server to Address Scalability Issues

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Find the Hidden Signal in Market Data Noise

SQL Server Everything built-in. Csom Gergely Microsoft Adat platform szakértő

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Building and Deploying Customer Behavior Models

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Advanced Big Data Analytics with R and Hadoop

Big Data and Data Science: Behind the Buzz Words

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Modern Data Architecture for Predictive Analytics

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Fast Analytics on Big Data with H20

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

HDP Enabling the Modern Data Architecture

The Future of Data Management

Data processing goes big

Understanding the Benefits of IBM SPSS Statistics Server

RevoScaleR Speed and Scalability

Platfora Big Data Analytics

Advanced In-Database Analytics

Cisco Data Preparation

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Apache Hadoop's Role in Your Big Data Architecture

HDP Hadoop From concept to deployment.

SEIZE THE DATA SEIZE THE DATA. 2015

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Comprehensive Analytics on the Hortonworks Data Platform

Architectures for Big Data Analytics A database perspective

whitepaper Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Data Integration Checklist

An In-Depth Look at In-Memory Predictive Analytics for Developers

Introducing Oracle Exalytics In-Memory Machine

Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

ANALYTICS CENTER LEARNING PROGRAM

Hadoop & SAS Data Loader for Hadoop

Presenters: Luke Dougherty & Steve Crabb

Netezza and Business Analytics Synergy

Luncheon Webinar Series May 13, 2013

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Building your Big Data Architecture on Amazon Web Services

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Open Source in Financial Services: Meet the challenges of new business models and disruption

Harnessing the power of advanced analytics with IBM Netezza

ANALYTICS IN BIG DATA ERA

ADVANCED ANALYTICS AND FRAUD DETECTION THE RIGHT TECHNOLOGY FOR NOW AND THE FUTURE

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

WHAT S NEW IN SAS 9.4

How To Use Hp Vertica Ondemand

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

From Spark to Ignition:

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

The Future of Data Management with Hadoop and the Enterprise Data Hub

Using Tableau Software with Hortonworks Data Platform

The Enterprise Data Hub and The Modern Information Architecture

2015 Ironside Group, Inc. 2

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

2015 Workshops for Professors

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

I/O Considerations in Big Data Analytics

Data Warehouse as a Service. Lot 2 - Platform as a Service. Version: 1.1, Issue Date: 05/02/2014. Classification: Open

The Inside Scoop on Hadoop

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Big Data Technologies Compared June 2014

Making big data simple with Databricks

Bringing the Power of SAS to Hadoop. White Paper

Ganzheitliches Datenmanagement

Cisco IT Hadoop Journey

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Transcription:

Revolution R Enterprise: Efficient Predictive Analytics for Big Data Prepared for The Bloor Group August 2014 Bill Jacobs Director Product Marketing / Field CTO - Big Data Products bill.jacobs@revolutionanalytics.com @bill_jacobs

Revolution R Enterprise Why R? Who[m] is Revolution Analytics?

What is R? Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R R is Hot bit.ly/r-is-hot WHITE PAPER

Graphics and Data Visualization Functions for standard graphs Scatterplot, time series, histogram, smoothing, Bar plot, pie chart, dot chart, Image plot, 3-D surface, map, Influences from Cleveland, Tufte etc. Conditioning, small multiples, use of color Customize without limits Combine graph types Create entirely new graphics 4

Breadth of R Community Collaboration CRAN Task Views (subset of 5000+ packages) Source: http://www.maths.lancs.ac.uk/~rowlings/r/taskviews/ 5

Exploding growth and demand for R R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R R is the first choice of more data miners than any other software Source: www.rexeranalytics.com R is the highest paid IT skill Dice.com, Jan 2014 R most-used data science language after SQL O Reilly, Jan 2014 R is used by 70% of data miners Rexer, Sep 2013 R is #15 of all programming languages RedMonk, Jan 2014 R growing faster than any other data science language KDnuggets, Aug 2013 More than 2 million users worldwide

OUR COMPANY OUR SOFTWARE SOME KUDOS MOUNTAIN VIEW LONDON SINGAPORE The leading provider of advanced analytics software and services based on open source R, since 2007 The only Big Data, Big Analytics software platform based on the data science language R Visionary Gartner Magic Quadrant for Advanced Analytics Platforms, 2014 7

Technical Support for Open Source R AdviseR from Revolution Analytics Technical support for open source R, from the R experts. 24x7 email and phone support On-line case management and knowledgebase Access to technical resources, documentation and user forums Exclusive on-line webinars from community experts Guaranteed response times R SUPPORT 12 INCIDENTS 12 MONTHS $3000 PER USER www.revolutionanalytics.com/adviser 8

Big Data Big Analy,cs is different

R: Open Source that Drives Innovation but. It Has Some Limitations for Enterprises Big Data In-memory bound Hybrid memory & disk scalability Speed of Analysis Enterprise Readiness Analytic Breadth & Depth Commercial Viability Operates on bigger volumes & factors Single threaded Parallel threading Shrinks analysis time Community support Commercial support Delivers full service production support 5000+ innovative analytic packages Risk of deployment of open source Leverage open source packages plus Big Data ready packages Commercial license Supercharges R Eliminate risk with open source 10

The Goal: Desktop Users with Analytical Access to Huge Data in Big Data Platforms Predictive Modeling Algorithms Big Data 11

Goal #1: Simplicity One Call Does It All In Revolution R Enterprise: 1. Starts A Master Process 2. Distribute Work 1. Threading? Cores? Sockets? Nodes? 2. Available RAM? 3. Location of Data? 3. Master Tasks for Each Node rxsetcomputecontext(rxhadoopmr( ) arrdelaylm1 <- rxlinmod( ) Initiator Finalizer Task Task Task Big Data 4. Master Initiates Distributed Work 1. Hadoop Schedules Mapper for Each Split 2. Algorithm Computes Intermediate Result 3. Reducer Combines Intermediate Results 5. Master Process Evaluates Completion 6. Returns Consolidated Answer to Script 12

Goal #2: Scale Analyze Data That Far Exceed Memory Available Predictive Modeling Big Data 13

Goal #3: Manageability: Big Data Means One Repository with Many Users Other Workloads Predictive Modeling YARN, Teradata Other Workloads 14

Goal #5: Stability: Invest in R with Confidence 1.0 2.0 3.0? 15

Goal #6: Reach: Deploy Analytics to Enterprise Audiences Information Users Business Analysts IT Pro s 16

Goal #7: Portability Write Once, Deploy Anywhere (well, almost anywhere ) 17

Revolution R Enterprise What is in Revolution R Enterprise? How does it perform?

Introducing Revolution R Enterprise (RRE) The Big Data Big Analytics Platform Big Data Big Analytics Ready DevelopR DeployR Enterprise readiness High performance analytics R+CRAN RevoR ConnectR ScaleR DistributedR Multi-platform architecture Data source integration Development tools Deployment tools 19

The Platform Step by Step: R Capabilities R+CRAN Open source R interpreter UPDATED R 3.0.2 Freely-available R algorithms Algorithms callable by RevoR Embeddable in R scripts 100% Compatible with existing R scripts, functions and packages R+CRAN RevoR DevelopR DeployR ConnectR ScaleR DistributedR RevoR Performance enhanced R interpreter Based on open source R Adds high-performance math Available On: Platform TM LSF TM Linux Microsoft HPC Clusters Windows & Linux Servers Windows & Linux Workstations IBM Netezza NEW Cloudera Hadoop NEW Hortonworks Hadoop NEW Teradata Database Intel Hadoop IBM BigInsights TM 20

Revolution R Enterprise: RevoR - Performance Enhanced R Open Source R Customers report Revolution 3-50x R performance improvements Enterprise compared to Open Source R without changing any code Computation (4-core laptop) Open Source R Revolution R Speedup Linear Algebra 1 Matrix Multiply 176 sec 9.3 sec 18x Cholesky Factorization 25.5 sec 1.3 sec 19x Linear Discriminant Analysis 189 sec 74 sec 3x General R Benchmarks 2 R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable 1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php 2. http://r.research.att.com/benchmarks/ 21

The Platform Step by Step: Parallelization & Data Sourcing ScaleR Ready-to-Use high-performance big data big analytics Fully-parallelized analytics Data prep & data distillation Descriptive statistics & statistical tests Correlation & covariance matrices Predictive Models linear, logistic, GLM Machine learning Monte Carlo simulation Tools for distributing customized algorithms across nodes R+CRAN RevoR DevelopR ConnectR ScaleR DistributedR DeployR ConnectR High-speed & direct connectors Available for: High-performance XDF SAS, SPSS, delimited & fixed format text data files Hadoop HDFS (text & XDF) Teradata Database & Aster EDWs and ADWs ODBC DistributedR Distributed computing framework Delivers portability across platforms Available on: Windows Servers Red Hat Linux SuSE Linux Servers IBM Platform LSF Linux Microsoft HPC Clusters Teradata Database Cloudera Hadoop Hortonworks Hadoop NEW MapR Hadoop 22

Revolution R Enterprise ScaleR: High Performance Big Data Analytics R Data Step Descriptive Statistics Statistical Tests Sampling Predictive Modeling Data Visualization Machine Learning Simulation 23

Revolution R Enterprise 7.0 ScaleR Parallelized Algorithms Data Step Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Sampling Subsample (observations & variables) Random Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Variable Selection Stepwise Regression Linear, Logistic and GLM Simulation Monte Carlo Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Combination Using Revolution rxdatastep and rxexec functions to combine open source R with Revolution R works with Open Source R v3 Confidential to Revolution Analytics.

Revolution R Enterprise ScaleR Outperforms SAS HAP at a Fraction of the Cost Logistic Regression: Rows of data 1 billion 1 billion Parameters just a few Double 7 Time 80 seconds 45% 44 seconds Data location In memory On disk Nodes 32 1/6 th 5 Cores 384 5% 20 RAM 1,536 GB 5% 80 GB Revolution R is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20 th as much RAM, a 6 th as many nodes, and not pre-loading data into RAM. Bottom Line: Revolution R Enterprise Performance = Greatly Reduced TCO *As published by SAS in HPC Wire, April 21, 2011 25

DistributedR: Platform Abstraction for ScaleR DevelopR DeployR R+CRAN RevoR ConnectR ScaleR DistributedR DistributedR Distributed Files Cloudera Hadoop Hortonworks Hadoop MapR Hadoop Under Development: Spark DistributedR MPI Based IBM Platform LSF Linux Microsoft HPC Clusters DistributedR Local Files Red Hat Linux SuSE Linux Servers Windows Server DistributedR SQL Teradata Database 26

Transparent Scaling of R Analytics on Hadoop Run RRE Analytics In Hadoop Without Change Eliminate Need To Design Parallel Software or Think In MapReduce Leverage All Revolution R Enterprise Pre-Parallelized Algorithms Enable Users To Build Custom Apps That Leverage Hadoop s Parallelism Slash Data Movement by Analyzing HDFS Data In place Expand Deployment and Integration Options Revolution R Enterprise Desktops & Servers Hadoop Cluster Revolution R Enterprise direct web services Corporate Applications Edge or Worker Node Data Nodes 27

Simplicity Goal: Leverage Teradata Database As A Parallel R Engine. Teradata Database Run RRE Code Without Change with Teradata as Engine Eliminate Need To Design Parallel Software or write UDFs Leverage All Revolution R Enterprise Pre- Parallelized Algorithms Enable Users To Build Custom Apps That Leverage Teradata MPP Slash Data Movement by Analyzing Data Inside the Database Expand Deployment and Integration Options 28

Support Multiplatform Deployments Data Repository Refining, Exploration & Profiling Development Environment Model Development & Testing Operations Application Deployment 29

Write Once. Deploy Anywhere. Hadoop Hortonworks Cloudera EDW IBM Netezza Teradata R+CRAN RevoR DevelopR ConnectR ScaleR DeployR Clustered Systems Workstations & Servers In the Cloud IBM Platform LSF Microsoft HPC Desktop Server Amazon AWS DistributedR DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 30

Goal #6: Reach: Enterprise Integration & Deployment Information Users Business Analysts IT Pro s 31

The Platform Step by Step: Tools & Deployment DevelopR Integrated development environment for R Visual step-into debugger Available on: Windows R+CRAN RevoR DevelopR DeployR ConnectR ScaleR DistributedR DeployR Web services software development kit for integration analytics via Java, JavaScript or.net APIs Integrates R Into application infrastructures Capabilities: Invokes R Scripts from web services calls RESTful interface for easy integration Works with web & mobile apps, leading BI & Visualization tools and business rules engines 32

DevelopR Integrated Development Environment Script with type ahead and code snippets Solutions window for organizing code and data Sophisticated debugging with breakpoints, variable values etc. Objects loaded in the R Environment Packages installed and loaded Object details http://www.revolutionanalytics.com/demos/revolution-productivityenvironment/demo.htm 33

RRE DeployR: Integration with 3rd Party Software R / Statistical Modeling Expert DeployR Deployment Expert Data Analysis Business Intelligence Seamless q Bring the power of R to any web enabled application Simple q Leverage common APIs including JS, Java,.NET Scalable q Robustly scale user and compute workloads Secure q Manage enterprise security with LDAP & SSO Mobile Web Apps Cloud / SaaS 34

The Power of Revolution R Enterprise: Performance & Scalability In-Hadoop Execution ScaleR Moves computation to data In-Database Execution ScaleR Moves computation to data Parallelized User Code ScaleR Leverage CRAN V a l u e Parallelized Algorithms Grid Processing Multi-Threaded Execution Memory Management ScaleR DistributedR DistributedR DistributedR Labor saving power Maximizes computation Powerful divide & conquer Effective memory utilization Fast Math Libraries RevoR 3-50X faster R + CRAN Open Source Leverage latest innovation 35

Revolution R Enterprise Pre-Parallelized Algorithms vs. Manual Approaches

Parallelizing Algorithms My Hand: Illustrated Using K-Means Clustering The Goal: Compute the Algorithm on Distributed Data. Compute Algorithm 37

Parallelizing Algorithms My Hand: Illustrated Using K-Means Clustering The Goal: Compute the Algorithm on Distributed Data. Compute Algorithm The Fun Part: Re-Factor the Algorithm For Distributed Data. Distributed K-Means Requires About 6 Steps. 1 2 3 4 5 6 38

Parallelizing Algorithms My Hand: Illustrated Using K-Means Clustering The Goal: Compute the Algorithm on Distributed Data. Compute Algorithm The Fun Part: Re-Factor the Algorithm For Distributed Data. Distributed K-Means Requires About 6 Steps. 1 2 3 4 5 6 2 4 6 8 10 12 The Knit Everything Together 1 2 2 3 4 5 6 7 4 6 8 8 9 10 11 12 13 10 12 39

Parallelizing Algorithms My Hand: Illustrated Using K-Means Clustering The Goal: Compute the Algorithm on Distributed Data. Compute Algorithm The Fun Part: Re-Factor the Algorithm For Distributed Compute. Distributed K-Means Requires About 6 Steps. 1 2 3 4 5 6 The Tedious Part: The Knit Everything Together 1 2 4 6 2 3 4 5 6 7 2 4 6 8 10 12 8 9 10 11 12 13 8 10 12 Counting All the Steps, Efficient Distributed Computation of the K-Means Clustering Algorithm Requires approximately 13 Steps: 1 Initiator, 6 Steps, 5 Combiners and 1 Finalizer. and the result runs on a single platform. 40

Parallelizing Algorithms By Hand: Illustrated Using K-Means Clustering 1 2 3 4 5 6 7 8 9 10 11 12 13 The Goal: Compute the Kmeans Clustering Algorithm Manual Parallelization: rhadoop, IOTools, Hadoop Streaming 1) Refactor algorithm math for parallelism 2) Write 13 components 1 2 4 2 2 3) Build the Algorithm 4) Test and Debug 3 4 5 6 7 4 6 6 8 10 5) Document 6) Optimize 7) Use. 8 8 9 10 11 12 13 10 12 12 - Or - Compute Algorithm Automatic Parallelization: With Revolution R Enterprise 1 2 4 2 2 Use the algorithm from day one. 3 4 5 6 7 4 6 6 8 10 8 8 9 10 11 12 13 + Custom Algorithms. + Multiple Platforms. and the result runs on a all ScaleR platforms. 10 12 12 41

Revolution R Enterprise and Big Data Platforms How Does RRE Play Inside Hadoop How Does RRE Play Inside the Teradata Database?

Parallel Algorithms Running Hadoop Transparent Distribution of Computation Revolution R Enterprise Desktops & Servers Hadoop Cluster Data Nodes Edge Node Master Process Reducer Mapper Mapper Mapper Mapper 43

How Does It Work? RRE in Teradata Database Revolution R Enterprise Desktops & Servers ODBC Teradata Database Parse Engine External Stored Procedure Database Nodes AMPs Table Operator Table Operator Table Operator Table Operator Hybrid Storage Data Segments 44

Revolution Analytics Support Offerings Support and Training Products Quick Start Programs

Customer Support Customer Support is included with Subscription Standard Support Today: 9-5pm PT telephone, email, web. Also, via a support forum Q4 13: Standard 9-6pm Regional Hours Premium Support Today: Premium Support as Blocks On-Demand Consulting Q1 14: 24x7x365 Support for Severity 1 Issues Q1 14: Technical Account Management 46

Revolution Analytics Professional Services Training Comprehensive Topics Self Paced & Classroom Customizable Consulting Remote & On site Projects & Staff Aug Quick Start Programs Entire project lifecycle support Bundle them together 47

Training Onsite Online Custom 48

Quick Start Options 49

Revolution R Enterprise Ecosystem Power of Integration SI / Service Deployment / Consumption MSP / DSP Advanced Analytics ETL Corios Data / Infrastructure 50

Big Data Analytics Applications Our Customers and Prospects Are Attacking Marketing: Clickstream & Campaign Analyses Commerce: Recommendation Engines Social Media: Sentiment Analysis Behavior Prediction: Purchase Prediction Outlier Detection: Fraud Waste and Abuse Risk Analysis: Insurance Underwriting Machine Telemetry: Predictive Maintenance Operations: Automotive Warranty Optimization Econometrics: Market Prediction Product Marketing: Mix and Price Optimization Life Sciences: Treatment Outcome Prediction Drug Development: Pharmacogenetics Transportation: Asset Utilization 51

Revolution Analytics - Overview We are the only provider of a commercial analytics platform based on the open source R statistical computing language. Stable, scalable multi-platform with world-wide support Enterprise Readiness Power Productivity Distributed, high performance analytical algorithms Easier to build and deploy analytic applications Professional services enablement World Wide Support Teams Standard and Premium Programs Technical Account Managers Customer Success Managers Professional Services Architecture planning Systems Integration Advanced analytic applications Full life cycle projects 52

Why Customers Buy Revolution R Enterprise Powers R for the Enterprise Supercharge R with Big Data Scale Enabling Analytics Across the Big Data Platforms Lower Cost & Risk of Big Analytics 53