Revolution R Enterprise: Efficient Predictive Analytics for Big Data Prepared for The Bloor Group August 2014 Bill Jacobs Director Product Marketing / Field CTO - Big Data Products bill.jacobs@revolutionanalytics.com @bill_jacobs
Revolution R Enterprise Why R? Who[m] is Revolution Analytics?
What is R? Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R R is Hot bit.ly/r-is-hot WHITE PAPER
Graphics and Data Visualization Functions for standard graphs Scatterplot, time series, histogram, smoothing, Bar plot, pie chart, dot chart, Image plot, 3-D surface, map, Influences from Cleveland, Tufte etc. Conditioning, small multiples, use of color Customize without limits Combine graph types Create entirely new graphics 4
Breadth of R Community Collaboration CRAN Task Views (subset of 5000+ packages) Source: http://www.maths.lancs.ac.uk/~rowlings/r/taskviews/ 5
Exploding growth and demand for R R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R R is the first choice of more data miners than any other software Source: www.rexeranalytics.com R is the highest paid IT skill Dice.com, Jan 2014 R most-used data science language after SQL O Reilly, Jan 2014 R is used by 70% of data miners Rexer, Sep 2013 R is #15 of all programming languages RedMonk, Jan 2014 R growing faster than any other data science language KDnuggets, Aug 2013 More than 2 million users worldwide
OUR COMPANY OUR SOFTWARE SOME KUDOS MOUNTAIN VIEW LONDON SINGAPORE The leading provider of advanced analytics software and services based on open source R, since 2007 The only Big Data, Big Analytics software platform based on the data science language R Visionary Gartner Magic Quadrant for Advanced Analytics Platforms, 2014 7
Technical Support for Open Source R AdviseR from Revolution Analytics Technical support for open source R, from the R experts. 24x7 email and phone support On-line case management and knowledgebase Access to technical resources, documentation and user forums Exclusive on-line webinars from community experts Guaranteed response times R SUPPORT 12 INCIDENTS 12 MONTHS $3000 PER USER www.revolutionanalytics.com/adviser 8
Big Data Big Analy,cs is different
R: Open Source that Drives Innovation but. It Has Some Limitations for Enterprises Big Data In-memory bound Hybrid memory & disk scalability Speed of Analysis Enterprise Readiness Analytic Breadth & Depth Commercial Viability Operates on bigger volumes & factors Single threaded Parallel threading Shrinks analysis time Community support Commercial support Delivers full service production support 5000+ innovative analytic packages Risk of deployment of open source Leverage open source packages plus Big Data ready packages Commercial license Supercharges R Eliminate risk with open source 10
The Goal: Desktop Users with Analytical Access to Huge Data in Big Data Platforms Predictive Modeling Algorithms Big Data 11
Goal #1: Simplicity One Call Does It All In Revolution R Enterprise: 1. Starts A Master Process 2. Distribute Work 1. Threading? Cores? Sockets? Nodes? 2. Available RAM? 3. Location of Data? 3. Master Tasks for Each Node rxsetcomputecontext(rxhadoopmr( ) arrdelaylm1 <- rxlinmod( ) Initiator Finalizer Task Task Task Big Data 4. Master Initiates Distributed Work 1. Hadoop Schedules Mapper for Each Split 2. Algorithm Computes Intermediate Result 3. Reducer Combines Intermediate Results 5. Master Process Evaluates Completion 6. Returns Consolidated Answer to Script 12
Goal #2: Scale Analyze Data That Far Exceed Memory Available Predictive Modeling Big Data 13
Goal #3: Manageability: Big Data Means One Repository with Many Users Other Workloads Predictive Modeling YARN, Teradata Other Workloads 14
Goal #5: Stability: Invest in R with Confidence 1.0 2.0 3.0? 15
Goal #6: Reach: Deploy Analytics to Enterprise Audiences Information Users Business Analysts IT Pro s 16
Goal #7: Portability Write Once, Deploy Anywhere (well, almost anywhere ) 17
Revolution R Enterprise What is in Revolution R Enterprise? How does it perform?
Introducing Revolution R Enterprise (RRE) The Big Data Big Analytics Platform Big Data Big Analytics Ready DevelopR DeployR Enterprise readiness High performance analytics R+CRAN RevoR ConnectR ScaleR DistributedR Multi-platform architecture Data source integration Development tools Deployment tools 19
The Platform Step by Step: R Capabilities R+CRAN Open source R interpreter UPDATED R 3.0.2 Freely-available R algorithms Algorithms callable by RevoR Embeddable in R scripts 100% Compatible with existing R scripts, functions and packages R+CRAN RevoR DevelopR DeployR ConnectR ScaleR DistributedR RevoR Performance enhanced R interpreter Based on open source R Adds high-performance math Available On: Platform TM LSF TM Linux Microsoft HPC Clusters Windows & Linux Servers Windows & Linux Workstations IBM Netezza NEW Cloudera Hadoop NEW Hortonworks Hadoop NEW Teradata Database Intel Hadoop IBM BigInsights TM 20
Revolution R Enterprise: RevoR - Performance Enhanced R Open Source R Customers report Revolution 3-50x R performance improvements Enterprise compared to Open Source R without changing any code Computation (4-core laptop) Open Source R Revolution R Speedup Linear Algebra 1 Matrix Multiply 176 sec 9.3 sec 18x Cholesky Factorization 25.5 sec 1.3 sec 19x Linear Discriminant Analysis 189 sec 74 sec 3x General R Benchmarks 2 R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable 1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php 2. http://r.research.att.com/benchmarks/ 21
The Platform Step by Step: Parallelization & Data Sourcing ScaleR Ready-to-Use high-performance big data big analytics Fully-parallelized analytics Data prep & data distillation Descriptive statistics & statistical tests Correlation & covariance matrices Predictive Models linear, logistic, GLM Machine learning Monte Carlo simulation Tools for distributing customized algorithms across nodes R+CRAN RevoR DevelopR ConnectR ScaleR DistributedR DeployR ConnectR High-speed & direct connectors Available for: High-performance XDF SAS, SPSS, delimited & fixed format text data files Hadoop HDFS (text & XDF) Teradata Database & Aster EDWs and ADWs ODBC DistributedR Distributed computing framework Delivers portability across platforms Available on: Windows Servers Red Hat Linux SuSE Linux Servers IBM Platform LSF Linux Microsoft HPC Clusters Teradata Database Cloudera Hadoop Hortonworks Hadoop NEW MapR Hadoop 22
Revolution R Enterprise ScaleR: High Performance Big Data Analytics R Data Step Descriptive Statistics Statistical Tests Sampling Predictive Modeling Data Visualization Machine Learning Simulation 23
Revolution R Enterprise 7.0 ScaleR Parallelized Algorithms Data Step Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Sampling Subsample (observations & variables) Random Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Variable Selection Stepwise Regression Linear, Logistic and GLM Simulation Monte Carlo Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Combination Using Revolution rxdatastep and rxexec functions to combine open source R with Revolution R works with Open Source R v3 Confidential to Revolution Analytics.
Revolution R Enterprise ScaleR Outperforms SAS HAP at a Fraction of the Cost Logistic Regression: Rows of data 1 billion 1 billion Parameters just a few Double 7 Time 80 seconds 45% 44 seconds Data location In memory On disk Nodes 32 1/6 th 5 Cores 384 5% 20 RAM 1,536 GB 5% 80 GB Revolution R is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20 th as much RAM, a 6 th as many nodes, and not pre-loading data into RAM. Bottom Line: Revolution R Enterprise Performance = Greatly Reduced TCO *As published by SAS in HPC Wire, April 21, 2011 25
DistributedR: Platform Abstraction for ScaleR DevelopR DeployR R+CRAN RevoR ConnectR ScaleR DistributedR DistributedR Distributed Files Cloudera Hadoop Hortonworks Hadoop MapR Hadoop Under Development: Spark DistributedR MPI Based IBM Platform LSF Linux Microsoft HPC Clusters DistributedR Local Files Red Hat Linux SuSE Linux Servers Windows Server DistributedR SQL Teradata Database 26
Transparent Scaling of R Analytics on Hadoop Run RRE Analytics In Hadoop Without Change Eliminate Need To Design Parallel Software or Think In MapReduce Leverage All Revolution R Enterprise Pre-Parallelized Algorithms Enable Users To Build Custom Apps That Leverage Hadoop s Parallelism Slash Data Movement by Analyzing HDFS Data In place Expand Deployment and Integration Options Revolution R Enterprise Desktops & Servers Hadoop Cluster Revolution R Enterprise direct web services Corporate Applications Edge or Worker Node Data Nodes 27
Simplicity Goal: Leverage Teradata Database As A Parallel R Engine. Teradata Database Run RRE Code Without Change with Teradata as Engine Eliminate Need To Design Parallel Software or write UDFs Leverage All Revolution R Enterprise Pre- Parallelized Algorithms Enable Users To Build Custom Apps That Leverage Teradata MPP Slash Data Movement by Analyzing Data Inside the Database Expand Deployment and Integration Options 28
Support Multiplatform Deployments Data Repository Refining, Exploration & Profiling Development Environment Model Development & Testing Operations Application Deployment 29
Write Once. Deploy Anywhere. Hadoop Hortonworks Cloudera EDW IBM Netezza Teradata R+CRAN RevoR DevelopR ConnectR ScaleR DeployR Clustered Systems Workstations & Servers In the Cloud IBM Platform LSF Microsoft HPC Desktop Server Amazon AWS DistributedR DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 30
Goal #6: Reach: Enterprise Integration & Deployment Information Users Business Analysts IT Pro s 31
The Platform Step by Step: Tools & Deployment DevelopR Integrated development environment for R Visual step-into debugger Available on: Windows R+CRAN RevoR DevelopR DeployR ConnectR ScaleR DistributedR DeployR Web services software development kit for integration analytics via Java, JavaScript or.net APIs Integrates R Into application infrastructures Capabilities: Invokes R Scripts from web services calls RESTful interface for easy integration Works with web & mobile apps, leading BI & Visualization tools and business rules engines 32
DevelopR Integrated Development Environment Script with type ahead and code snippets Solutions window for organizing code and data Sophisticated debugging with breakpoints, variable values etc. Objects loaded in the R Environment Packages installed and loaded Object details http://www.revolutionanalytics.com/demos/revolution-productivityenvironment/demo.htm 33
RRE DeployR: Integration with 3rd Party Software R / Statistical Modeling Expert DeployR Deployment Expert Data Analysis Business Intelligence Seamless q Bring the power of R to any web enabled application Simple q Leverage common APIs including JS, Java,.NET Scalable q Robustly scale user and compute workloads Secure q Manage enterprise security with LDAP & SSO Mobile Web Apps Cloud / SaaS 34
The Power of Revolution R Enterprise: Performance & Scalability In-Hadoop Execution ScaleR Moves computation to data In-Database Execution ScaleR Moves computation to data Parallelized User Code ScaleR Leverage CRAN V a l u e Parallelized Algorithms Grid Processing Multi-Threaded Execution Memory Management ScaleR DistributedR DistributedR DistributedR Labor saving power Maximizes computation Powerful divide & conquer Effective memory utilization Fast Math Libraries RevoR 3-50X faster R + CRAN Open Source Leverage latest innovation 35
Revolution R Enterprise Pre-Parallelized Algorithms vs. Manual Approaches
Parallelizing Algorithms My Hand: Illustrated Using K-Means Clustering The Goal: Compute the Algorithm on Distributed Data. Compute Algorithm 37
Parallelizing Algorithms My Hand: Illustrated Using K-Means Clustering The Goal: Compute the Algorithm on Distributed Data. Compute Algorithm The Fun Part: Re-Factor the Algorithm For Distributed Data. Distributed K-Means Requires About 6 Steps. 1 2 3 4 5 6 38
Parallelizing Algorithms My Hand: Illustrated Using K-Means Clustering The Goal: Compute the Algorithm on Distributed Data. Compute Algorithm The Fun Part: Re-Factor the Algorithm For Distributed Data. Distributed K-Means Requires About 6 Steps. 1 2 3 4 5 6 2 4 6 8 10 12 The Knit Everything Together 1 2 2 3 4 5 6 7 4 6 8 8 9 10 11 12 13 10 12 39
Parallelizing Algorithms My Hand: Illustrated Using K-Means Clustering The Goal: Compute the Algorithm on Distributed Data. Compute Algorithm The Fun Part: Re-Factor the Algorithm For Distributed Compute. Distributed K-Means Requires About 6 Steps. 1 2 3 4 5 6 The Tedious Part: The Knit Everything Together 1 2 4 6 2 3 4 5 6 7 2 4 6 8 10 12 8 9 10 11 12 13 8 10 12 Counting All the Steps, Efficient Distributed Computation of the K-Means Clustering Algorithm Requires approximately 13 Steps: 1 Initiator, 6 Steps, 5 Combiners and 1 Finalizer. and the result runs on a single platform. 40
Parallelizing Algorithms By Hand: Illustrated Using K-Means Clustering 1 2 3 4 5 6 7 8 9 10 11 12 13 The Goal: Compute the Kmeans Clustering Algorithm Manual Parallelization: rhadoop, IOTools, Hadoop Streaming 1) Refactor algorithm math for parallelism 2) Write 13 components 1 2 4 2 2 3) Build the Algorithm 4) Test and Debug 3 4 5 6 7 4 6 6 8 10 5) Document 6) Optimize 7) Use. 8 8 9 10 11 12 13 10 12 12 - Or - Compute Algorithm Automatic Parallelization: With Revolution R Enterprise 1 2 4 2 2 Use the algorithm from day one. 3 4 5 6 7 4 6 6 8 10 8 8 9 10 11 12 13 + Custom Algorithms. + Multiple Platforms. and the result runs on a all ScaleR platforms. 10 12 12 41
Revolution R Enterprise and Big Data Platforms How Does RRE Play Inside Hadoop How Does RRE Play Inside the Teradata Database?
Parallel Algorithms Running Hadoop Transparent Distribution of Computation Revolution R Enterprise Desktops & Servers Hadoop Cluster Data Nodes Edge Node Master Process Reducer Mapper Mapper Mapper Mapper 43
How Does It Work? RRE in Teradata Database Revolution R Enterprise Desktops & Servers ODBC Teradata Database Parse Engine External Stored Procedure Database Nodes AMPs Table Operator Table Operator Table Operator Table Operator Hybrid Storage Data Segments 44
Revolution Analytics Support Offerings Support and Training Products Quick Start Programs
Customer Support Customer Support is included with Subscription Standard Support Today: 9-5pm PT telephone, email, web. Also, via a support forum Q4 13: Standard 9-6pm Regional Hours Premium Support Today: Premium Support as Blocks On-Demand Consulting Q1 14: 24x7x365 Support for Severity 1 Issues Q1 14: Technical Account Management 46
Revolution Analytics Professional Services Training Comprehensive Topics Self Paced & Classroom Customizable Consulting Remote & On site Projects & Staff Aug Quick Start Programs Entire project lifecycle support Bundle them together 47
Training Onsite Online Custom 48
Quick Start Options 49
Revolution R Enterprise Ecosystem Power of Integration SI / Service Deployment / Consumption MSP / DSP Advanced Analytics ETL Corios Data / Infrastructure 50
Big Data Analytics Applications Our Customers and Prospects Are Attacking Marketing: Clickstream & Campaign Analyses Commerce: Recommendation Engines Social Media: Sentiment Analysis Behavior Prediction: Purchase Prediction Outlier Detection: Fraud Waste and Abuse Risk Analysis: Insurance Underwriting Machine Telemetry: Predictive Maintenance Operations: Automotive Warranty Optimization Econometrics: Market Prediction Product Marketing: Mix and Price Optimization Life Sciences: Treatment Outcome Prediction Drug Development: Pharmacogenetics Transportation: Asset Utilization 51
Revolution Analytics - Overview We are the only provider of a commercial analytics platform based on the open source R statistical computing language. Stable, scalable multi-platform with world-wide support Enterprise Readiness Power Productivity Distributed, high performance analytical algorithms Easier to build and deploy analytic applications Professional services enablement World Wide Support Teams Standard and Premium Programs Technical Account Managers Customer Success Managers Professional Services Architecture planning Systems Integration Advanced analytic applications Full life cycle projects 52
Why Customers Buy Revolution R Enterprise Powers R for the Enterprise Supercharge R with Big Data Scale Enabling Analytics Across the Big Data Platforms Lower Cost & Risk of Big Analytics 53