Using Microsoft R Server to Address Scalability Issues

Similar documents
High Performance Predictive Analytics in R and Hadoop:

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

Revolution R Enterprise: Efficient Predictive Analytics for Big Data

Revolution R Enterprise

In-Database Analytics Deep Dive with Teradata and Revolution R

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

SQL Server Everything built-in. Csom Gergely Microsoft Adat platform szakértő

Find the Hidden Signal in Market Data Noise

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Advanced Big Data Analytics with R and Hadoop

RevoScaleR Speed and Scalability

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Modern Data Architecture for Predictive Analytics

Big Data and Data Science: Behind the Buzz Words

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data Too Big To Ignore

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Architectures for Big Data Analytics A database perspective

Fast Analytics on Big Data with H20

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Data processing goes big

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Big Data With Hadoop

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Understanding the Benefits of IBM SPSS Statistics Server

Classroom Demonstrations of Big Data

Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Integrating Apache Spark with an Enterprise Data Warehouse

An In-Depth Look at In-Memory Predictive Analytics for Developers

Platfora Big Data Analytics

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

QUEST meeting Big Data Analytics

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

extreme Datamining mit Oracle R Enterprise

Cisco Data Preparation

BIG DATA TRENDS AND TECHNOLOGIES

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Advanced In-Database Analytics

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

BIG DATA What it is and how to use?

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Hadoop-BAM and SeqPig

Azure Machine Learning, SQL Data Mining and R

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

Using In-Memory Computing to Simplify Big Data Analytics

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Big Data on Microsoft Platform

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Challenges for Data Driven Systems

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Bringing Big Data Modelling into the Hands of Domain Experts

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Unified Big Data Processing with Apache Spark. Matei

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Hadoop Ecosystem B Y R A H I M A.

SEIZE THE DATA SEIZE THE DATA. 2015

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Microsoft Azure Machine learning Algorithms

IBM Netezza High Capacity Appliance

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Unlocking the True Value of Hadoop with Open Data Science

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Applied Multivariate Analysis - Big data analytics

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

Revolution R Enterprise 7 Hadoop Configuration Guide

Spark: Cluster Computing with Working Sets

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

MSCA Introduction to Statistical Concepts

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

whitepaper Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Trafodion Operational SQL-on-Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

6.0, 6.5 and Beyond. The Future of Spotfire. Tobias Lehtipalo Sr. Director of Product Management

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

H2O on Hadoop. September 30,

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Data Mining with Hadoop at TACC

STATISTICA Formula Guide: Logistic Regression. Table of Contents

I/O Considerations in Big Data Analytics

Moving From Hadoop to Spark

Accelerating and Simplifying Apache

IBM SPSS Modeler 15 In-Database Mining Guide

Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

Transcription:

Using Microsoft R Server to Address Scalability Issues

February 4th, 2016 - Welcome!

R What is it? Open Source lingua franca Global Community Ecosystem Can be Scaled to Big Data, Big Analytics Analytics, Computing, Modeling Millions of users 7000+ Algorithms, Test Data & Evaluations Scalability

R from Microsoft brings

Microsoft R Products Microsoft R Open Free and open source R distribution Enhanced and distributed by Revolution Analytics SQL Server R Services Built in Advanced Analytics and Stand Alone Server Capability Leverages the Benefits of SQL 2016 Enterprise Edition Microsoft R Server Microsoft R Server for Redhat Linux Microsoft R Server for SUSE Linux Microsoft R Server for Teradata DB Microsoft R Server for Hadoop on Redhat

Introducing Microsoft R Open Enhanced Open Source R distribution Based on the latest Open Source R (3.2.2) Built, tested and distributed by Microsoft Enhanced by Intel MKL Library to speed up linear algebra functions Compatible with all R-related software CRAN packages, RStudio, third-party R integrations, Revolutions Open-Source R packages Reproducible R Toolkit Checkpoint, minicran ParallelR parallelise execution via foreach loop Rhadoop rhdfs, rhbase, ravro, rmr2, plyrmr AzureML read/write data to AzureML, publish R code as ML API MRAN website mran.revolutionanalytics.com Enhanced documentation and learning resources Discover 6500 free add-on R packages Open source (GPLv2 license) - 100% free to download, use and share

CRAN R compared to Microsoft R Open More efficient and multi-threaded math computation. Benefits math intensive processing. No benefit to program logic and data transform Matrix calculation up to 27x faster Matrix functions up to 16x faster Programation 0x faster

9

CRAN, MRO, MRS Comparison Microsoft R Open Microsoft R Server Datasize In-memory In-memory In-Memory or Disk Based Speed of Analysis Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel highspeed functions License Open Source Open Source Commercial license. Supported release with indemnity

Introducing Microsoft R Server Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics exploration, analysis, visualization and modeling High-performance open source R plus: Data source connectivity to big-data objects Big-data advanced analytics Multi-platform environment support In-Hadoop and in-teradata predictive modeling Development and production environment support IDE for data scientist developers Secure, Scalable R Deployment R Open DeployR R Server DevelopR

The Microsoft R Server Platform R+CRAN Open source R interpreter R 3.1.2 Freely-available huge range of R algorithms Algorithms callable by RevoR Embeddable in R scripts 100% Compatible with existing R scripts, functions and packages MRO Performance enhanced R interpreter Based on open source R Adds high-performance math library to speed up linear algebra functions R Open Microsoft R Server DevelopR DeployR ScaleR Ready-to-Use high-performance big data big analytics Fully-parallelized analytics Data prep & data distillation Descriptive statistics & statistical tests Range of predictive functions User tools for distributing customized R algorithms across nodes Wide data sets supported thousands of variables ConnectR High-speed & direct connectors Available for: High-performance XDF SAS, SPSS, delimited & fixed format text data files Hadoop HDFS (text & XDF) Teradata Database & Aster EDWs and ADWs ODBC DistributedR Distributed computing framework Delivers cross-platform portability

ScaleR Parallel + Big Data Our ScaleR algorithms work inside multiple cores / nodes in parallel at high speed Stream data in to RAM in blocks. Big Data can be any data size. We handle Megabytes to Gigabytes to Terabytes XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing. Interim results are collected and combined analytically to produce the output on the entire data set

Scale R Parallelized Algorithms & Functions Data Preparation Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Sampling Subsample (observations & variables) Random Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Combination rxdatastep rxexec PEMA-R API Custom Algorithms

ScaleR - Performance comparison Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster. US flight data for 20 years Linear Regression on Arrival Delay Run on 4 core laptop, 16GB RAM and 500GB SSD

minutes Example of In-Database Acceleration 5+ hours to 40 seconds: R on a server pulling data via SQL R on a server Invoking RRE ScaleR Inside the EDW rows

Distributed R - Write Once. Deploy Anywhere. Hadoop Hortonworks Cloudera MapR + HD Insights + Hadoop Spark Microsoft R Server EDW Teradata + SQL Server v16 DevelopR Workstations & Servers Linux Windows ConnectR ScaleR In the Cloud Azure Marketplace + Azure ML DistributedR Code Portability Across Platforms Coming Soon

DistributedR - Remote Execution Pack and Ship Requests to Remote Environments Predictive Algorithm Work Results Distribute Work, Compile Results Algorithm Master Analyze Blocks In Parallel Load Block At A Time Big Data Microsoft R Server functions A compute context defines remote connection Microsoft R functions prefixed with rx Current compute context determines processing location The Results: Even Faster Computation Larger Data Set Capacity Fewer Security Concerns No Data Movement, No Copies

DistributedR - Revolution Code Portability ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce Compute context R script sets where the model will run ### SETUP LOCAL ENVIRONMENT VARIABLES ### mylocalcc <- localpar ### LOCAL COMPUTE CONTEXT ### rxsetcomputecontext(mylocalcc) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### linuxfs <- RxNativeFileSystem() ) AirlineDataSet <- RxXdfData( AirlineDemoSmall/AirlineDemoSmall.xdf, filesystem = linuxfs) Functional model R script does not need to change to run in Hadoop Local Parallel processing Linux or Windows ### SETUP HADOOP ENVIRONMENT VARIABLES ### myhadoopccc <- RxHadoopMR() ### HADOOP COMPUTE CONTEXT ### rxsetcomputecontext(myhadoopcc) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsfs <- RxHdfsFileSystem() AirlineDataSet <- ### ANALYTICAL PROCESSING ### RxXdfData( AirlineDemoSmall/AirlineDemoSmall.xdf ) ### Statistical Summary of the data, filesystem = hdfsfs) rxsummary(~arrdelay+dayofweek, data= AirlineDataSet, reportprogress=1) ### CrossTab the data rxcrosstabs(arrdelay ~ DayOfWeek, data= AirlineDataSet, means=t) In Hadoop ### Linear Model and plot hdfsxdfarrlatelinmod <- rxlinmod(arrdelay ~ DayOfWeek + 0, data = AirlineDataSet) plot(hdfsxdfarrlatelinmod$coefficients)

DistributedR - In-Hadoop Uses Hadoop nodes for R computations Eliminate data movement latency on very large data Remove data duplication Faster model development No MapReduce R coding Develop better models using all the data = Microsoft R Server

DistributedR - Hadoop Processing Methods Method 1 1 Edge node 1:n data Compute nodes Context Compute Hadoop Context Compute Context Hadoop Local Parallel 1 Edge node Compute Context Local Parallel 1:n nodes Compute Context Compute Hadoop Context Compute Hadoop Context Hadoop Linux FS Read / write Processing Data Csv, Xdf Csv, Xdf Csv, Xdf Linux (Local) File-System 1:n disks Copy to Local File HDFS 1:(n x number of nodes) disks Linux (Local) File-System 1:n disks HDFS 1:(n x number of nodes) disks Method 1: Local (Linux) parallel processing using all cores on one node, copying data from HDFS to store in local Linux file-system. ( Beside or Edge ) Method 2: Local (Linux) parallel processing using all cores on one node, streaming data from / to HDFS

DistributedR - Hadoop Processing Methods Method 3 1 Edge node Compute Context Local Parallel Csv, Xdf Linux (Local) File-System 1:n disks R script sent to data nodes 1:n nodes Compute Context Compute Hadoop Context Compute Hadoop Context Hadoop HDFS Read / write Csv, Xdf HDFS 1:(n x number of nodes) disks ( inside ) Method 3: Hadoop (Map-Reduce) parallel processing using all cores on n nodes, using HDFS data on each node Processing Data R model script sent to Master Node: 1. Starts a master process 2. Distribute work 3. Master tasks for each node 4. Master initiates distributed work 1.Hadoop schedules mapper for each split 2.Algorithm computes intermediate result 3.Reducer combines intermediate results 5. Master process evaluates completion 6. Iterates as required by the algorithm 7. Returns consolidated answer to script

DistributedR - What processing mode to use? Analytic data set size and processing complexity (e.g. simple summary statistics vs iterative algorithm) guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-hadoop processing) Processing Complexity Data Size Small Data < 10GB Low Medium High Legend Edge Node Linux processing In-Hadoop processing Local Linux file-system Hadoop file-system Medium Data < 50GB Bigger Data > 50GB