R and Hadoop: Architectural Options Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs
Polling Question #1: Who Are You? (choose one) Statistician or modeler who uses R Other R developer Hadoop Expert Application builder Data guru Business user Systems vendor or reseller Something else
Agenda Challenges Options Considerations How to Choose
Boundless Opportunities Marketing: Clickstream & Campaign Analyses Digital Media: Recommendation Engines Retail: Social Sentiment Analysis Insurance: Fraud Waste and Abuse Healthcare Delivery: Outcome Prediction Manufacturing: Quality Optimization P&C Insurance: Risk Analysis Consumer Products: Warranty Optimization Operations: Supply Chain Optimization Econometrics: Market Prediction Marketing: Mix and Price Optimization Life Sciences: Pharmacogenetics Transportation: Asset Utilization
Polling Question #2: What Industry Do You Represent? Financial Services Insurance Healthcare, Life Sciences or Pharma Manufacturing Energy Retail Logistics and Transportation Education Government Marketing & Advertising Technology Other
In A Perfect World Analytical Capability Security Compute Ease Data Scale Price Users
Hadoop Analytics - Many Alternatives R Based Alternatives Legacy tools updated SAS HPA, etc. Big Data Databases Other Languages Scala, Java, Julia, various GUIs Today s Topic: R-Based Alternatives Beside Architectures Inside Architectures Open Source and Commercial
Reality: Tradeoffs. Traditional Statistics vs. Machine Learning In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs. Automatic Distribution Real-Time vs. MapReduce Locality vs. Movement Memory Limits
No Magic Bullet.
Corporate Overview & Quick Facts Revolution R Enterprise is the leading commercial analytics platform based on the open source R statistical computing language Founded Office Locations CEO 2008 (as REvolution Computing) Palo Alto (HQ), Seattle (Engineering) Singapore London David Rich Number of customers 200+ Investors Northbridge Venture Partners Intel Capital Platform Vendor Web site: www.revolutionanalytics.com
Revolution Analytics Our Vision: R becomes the defacto standard for enterprise predictive analytics Our Mission: Drive enterprise adoption of R by providing enhanced R products tailored to meet enterprise challenges
Revolution Analytics Builds & Delivers: Software Products: Support & Services Stable Distributions Commercial Support Programs Broad Platform Support Training Programs Professional Services Big Data Analytics in R Application Integration Community Programs Deployment Platforms Academic Support Programs Agile Development Tooling Contributions to Open Source R Future Platform Support Open Source Extensions Sponsorship of R User Groups
Revolution Analytics Technical Innovations R Options from Open Source Production Deployment to Enterprise Support Parallelized Analytical Computation In-Database & In-Hadoop Analytics Big Data Scalability Multi-Platform Deployment Legacy Data Format Support Multiple IDE Options PMML Model Export Remote Execution
The Revolution R Product Suite Revolution R Open Free and open source R distribution Enhanced and distributed by Revolution Analytics Revolution R Plus Open-source distribution of R, packages, and other components Enhanced, supported and indemnified by Revolution Analytics Revolution R Enterprise Secure, Scalable and Supported Distribution of R With proprietary components created by Revolution Analytics
Polling Question #3: State Play: In your company you are Building Our Data Lake Running R + Hadoop Data Today Running R inside Hadoop using Open source Running RRE inside Hadoop Deploying Business Apps. Using Analytics from Hadoop Data Looking at Next Steps e.g. Spark, etc.
Revolution Analytics: Eight Alternatives for Integrating R & Hadoop Open Source 1. Open Source R 2. Revolution R Open 3. Open Source Parallelization on Workstations & Servers 4. rhadoop: Open Source Parallelization with rhadoop Commercial 5. Revolution R Enterprise on Servers & Workstations 6. Revolution R Enterprise on Edge Nodes 7. Revolution R Enterprise Inside Hadoop 8. Combined Edge Node & Inside Hadoop
1. Open Source R Integrated With Hadoop Traditional Open Source R Beside Architecture: CRAN Algorithms rodb C rhdfs rhbas e rhive Traditional Open Source Memory- Limited Data Moves
2. Revolution R Open On Workstations & Servers Replace Open Source R Beside Architecture with Revolution R Open CRAN Algorithms rodb C rhdfs rhbas e rhive As with Open Source R: Still Free. Still Memory Based. Data Still Moves. Improvements: Accelerates Math with Intel MKL Improves R-based packages Limitations No Effect for non-r Code
Accelerate R Math with Intel Math Kernel Lib s. Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html
3. Write Parallel Algorithms PC, Server or Clusters Write R Code to Explicitly Parallelize Deploy Across Several Systems ForEach & Iterator DoParallel (PC, server) DoMPI (cluster) RRE RxEXEC Example Uses: Bootstrapping Simulation HPC Can Include CRAN Algorithms Carefully rodb C rhdfs rhbas e rhive As with Previous: Still Free. Still Memory Based. Data Still Moves. Intel MKL with RRO Improvements: Parallelized Execution Limitations: Parallelization Difficulty Data Movement Platform Specific
4. rhadoop: Custom Parallel Execution for Hadoop Execute R Code & CRAN Algorithms Inside Hadoop Remote Desktop Example Uses: Scoring Transformation Easily Parallelized Algorithms R Code rmapreduce Hadoop Streaming Can Include CRAN Algorithms rhbase rhdfs As With Previous: Still Free. Optional Intel MKL in RRO Improvements: Runs R in MapReduce No Data Movement Limitations: Manual Parallelization Hadoop Specific
5. Revolution R Enterprise (RRE) PEMAs inside Hadoop Traditional Beside Architecture with Optimized Algorithms Available for Windows, Linux As With Previous: Includes Intel MKL in RRO Revolution R Enterprise: ScaleR PEMA Algorithms plus All of CRAN (subject to memory limits) rodb C rhdfs rhbas e rhive Advantages Speed: PEMAs Parallelize Across Threads, Cores & Sockets Scale: PEMAs Chunk - no Memory Limits All of CRAN Available Portability Fully Supported Limitations: Data Movement Single Machine
Revolution R Enterprise is. the only big data big analytics platform based on open source R High Performance, Scalable Analytics Portable Across Enterprise Platforms Easier to Build & Deploy Analytics
ScaleR Refactor Algorithms for Dramatic Performance and Capacity Improvement
ScaleR High Performance Algorithms for the Most Common Uses Data Step Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Sampling Subsample (observations & variables) Random Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Revolution Analytics Confidential Under NDA Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification New in Decision Trees 7.3 Decision Forests Gradient Boosted Decision Trees Combination PEMA-R API rxdatastep rxexec 25
What s a PEMA? Parallel External Memory Algorithms Script Calls ScaleR Algorithm Scripts can call CRAN Open Source Algorithms Start & Manage Processing Master Algorithm Process Combine Individual Results ScaleR PEMA Load Block At A Analyze Each Time Block Data Not Limited to Available Memory Unlimited Data Scale Ingests Data One Chunk At A Time. Adjustable Memory Footprint Multi-Thread Execution Performance Highly-Optimized Algorithms Algorithm Math Fully Refactored for Parallelism Delivered as ScaleR Library in Revolution R Enterprise
6. Run Revolution R Enterprise on Hadoop Edge Node(s) Fast Single-Server Alternative for Modest Data Scale (opt.) Thin Client or Remote Desktop ScaleR + CRAN Algorithms Edge Node rodb C rhdfs rhbas e rhive Local File System As With Previous: Single Machine Execution PEMA Scale & Speed (Single Machine) Use ScaleR + CRAN Accelerate R with Intel MKL Improvements: Easily Shared via No Data Movement Develop on Desktop Run on Edge Node Limitations: Shorter Trip for Data
7. Fast, Transparent Parallel Computation Inside Hadoop YARN/MapReduce Fast Parallelized Analytics on Large Data Sets In Hadoop Desktop & Server Tools and Applications We Web b Services vice s DeployR Remote Execution jobtracker ScaleR Algorithms As With Previous: Speed and Scale of ScaleR PEMA Algorithms Use CRAN Where Appropriate Accelerate R Math with MKL Custom Parallelized Algo s Advantages Parallel Computation No Data Movement ScaleR PEMA Parallelization Can Parallelize CRAN Carefully Portable Coding Limitations: Hadoop Workload Profiles
One Client s Experience with RRE on Hadoop Test Cluster - 9 Nodes Task Processing Time Importing and Filtering Datasets from HDFS 14 Million Observations 82 sec. 227 Million Observations 310 sec. Modeling and Estimation 1 Edge Node 2 Admin Nodes 9 Task Nodes 1.2 M Correlations 2771 sec. Simple Linear Regression, 227 M Observations 61 sec. Multiple Linear Regression, Three Variables, 227 M Observations Multiple Linear Regression, Four Variables, 227 M Observations 58 sec. 58 sec. 128GB 24 cores each 128GB 24 cores each 64GB 24 cores each Random Forest, 10 Predictor Variables, 227 M Observations, 10 Trees with Max Depth of 10 Splits 2 hr. 3 min. 29
8. Combined Edge Node & In-Hadoop Maximized Flexibility, Performance & Workload Handling Thin Client Development Remote Execution ScaleR Algorithms As With Previous: Speed and Scale of ScaleR PEMA Algorithms Use CRAN Where Appropriate Accelerate R Math with MKL Custom Parallelized Algo s Desktop & Server We Tools and b Applications Ser vice s rstudio DeployR Advantages Flexibility for Blended Workloads Little or No Data Movement Maximize CRAN Capabilities by Sharing Large RAM Edge Nodes
Occasionally Conflicting Criteria Infrastructure Criteria: Big Data Platform Vendor Choice Data Ingest Data Security Data Governance Data Science Criteria: Performance Self Service Flexibility Collaboration Sharing Capability
Key Questions: Where are the bulk of your skills? SAS? R? Java? Python? SQL? Where do you build models today? Do you have the skills to parallelize algorithms? Can models be built on a big shared server? How will you run models? Do you have the budget to purchase commercial solutions? How will your needs change over time? What is your future architecture plan? How risk averse is your management team regarding new platforms and open source?
Key Questions (cont.) What Workloads Do You Anticipate? How May Users? What Workloads? Workload Realities: Many small tasks do not run well in MapReduce Large data movements / duplications are costly What Use Cases Will You Encounter? Traditional statistical exploration, modeling? Behavior Prediction? Outlier Detection? Simulation and HPC? Massively wide data? Real-Time scoring? Internet of Things?
Eight Steps to Fast, Scalable R Analytics with Hadoop Open Source Options 1. Open Source R 2. Revolution R Open 3. Open Source Parallelization 4. rhadoop Commercial Options 5. RRE on Servers & Workstations 6. RRE on Edge Nodes 7. RRE Inside Hadoop 8. RRE on Edge Node & Inside Hadoop No Clear Winner: Budget & use case determine optimal path Compelling options in both open source & commercial source RRE ScaleR uniquely provides automatic parallelization Current Hadoop platforms are fast for large scale analytics. Combined in-server & in-hadoop fits majority of cases
2015 Challenges & Opportunities Evolving Hadoop Architectures In-Memory Analytics Spark, YARN Containers, Caching Additional Algorithm Parallelization Cluster Management Cloud and Hybrid Cloud Clusters SQL on Hadoop Battle-Royale Addressing the Resource Reality Integration, Deployment Both Drain on Expensive Resources Leverage other skills Design efficient collaboration Analytics for the Rest of Us New Consumption Targets Mobile New Participants in Design Business Users
Recommended Resources Revolution Analytics Products http://www.revolutionanalytics.com/products http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws Whitepaper: Delivering Value from Big Data with Revolution R Enterprise and Hadoop http://www.revolutionanalytics.com/whitepaper/delivering-value-big-datarevolution-r-enterprise-and-hadoop Revolution Analytics on Social Media: http://blog.revolutionanalytics.com/ @revolutionr on Twitter @bill_jacobs on Twitter
Thank you. www.revolutionanalytics.com 1.855.GET.REVO Twitter: @RevolutionR