High Performance Predictive Analytics in R and Hadoop:

Similar documents

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

Revolution R Enterprise: Efficient Predictive Analytics for Big Data

Revolution R Enterprise

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

In-Database Analytics Deep Dive with Teradata and Revolution R

Using Microsoft R Server to Address Scalability Issues

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Find the Hidden Signal in Market Data Noise

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Advanced Big Data Analytics with R and Hadoop

RevoScaleR Speed and Scalability

Fast Analytics on Big Data with H20

Modern Data Architecture for Predictive Analytics

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Big Data and Data Science: Behind the Buzz Words

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Advanced In-Database Analytics

Architectures for Big Data Analytics A database perspective

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Understanding the Benefits of IBM SPSS Statistics Server

SQL Server Everything built-in. Csom Gergely Microsoft Adat platform szakértő

SEIZE THE DATA SEIZE THE DATA. 2015

High-Performance Analytics

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Bringing Big Data Modelling into the Hands of Domain Experts

RevoScaleR 7.3 HPC Server Getting Started Guide

2015 Workshops for Professors

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

An In-Depth Look at In-Memory Predictive Analytics for Developers

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

SEIZE THE DATA SEIZE THE DATA. 2015

Predictive Modeling Techniques in Insurance

Building and Deploying Customer Behavior Models

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Extend your analytic capabilities with SAP Predictive Analysis

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

What s New in MATLAB and Simulink

Benchmarking Hadoop & HBase on Violin

Introducing Oracle Exalytics In-Memory Machine

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Harnessing the power of advanced analytics with IBM Netezza

ANALYTICS CENTER LEARNING PROGRAM

Integrating Apache Spark with an Enterprise Data Warehouse

Cisco Data Preparation

Bringing the Power of SAS to Hadoop. White Paper

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Next Generation Data Warehousing Appliances

WHAT S NEW IN SAS 9.4

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

BIG DATA What it is and how to use?

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Apache Hadoop: Past, Present, and Future

In-Memory Analytics for Big Data

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Hadoop & SAS Data Loader for Hadoop

2015 The MathWorks, Inc. 1

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

ANALYTICS IN BIG DATA ERA

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Azure Machine Learning, SQL Data Mining and R

Data Mining in the Swamp

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

extreme Datamining mit Oracle R Enterprise

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Data processing goes big

Native Connectivity to Big Data Sources in MSTR 10

Journée Thématique Big Data 13/03/2015

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

Microsoft Research Microsoft Azure for Research Training

Using In-Memory Computing to Simplify Big Data Analytics

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Why DBMSs Matter More than Ever in the Big Data Era

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Big Data Too Big To Ignore

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Microsoft Research Windows Azure for Research Training

Big Data Analytics in R

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Transcription:

High Performance Predictive Analytics in R and Hadoop: Achieving Big Data Big Analytics Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist August 27, 2013 1

Polling Questions 1 & 2 2

Agenda Revolution Confidential Riding the Hadoop Wave Big Data Big Analytics R + Hadoop from Revolution Analytics Revolution R Enterprise ScaleR Getting Started Q&A 3

Riding the Hadoop Wave

Solve old problems in new ways Solve new problems

If you want something you ve never had, you must be willing to do something you ve never done.

Major entertainment company integrates analytics across brands

Fraud detection interval reduced from 2 weeks to 7 hours

Predict mortgage default in time to avoid it

Recommend optimal growing plan

Big Data Big Analytics is different

+ +

Big Data is big, complex and messy

Big Analytics are compute intensive

Big Data Big Analytics rewards you

+ +

Innovate with R Download the White Paper R is Hot bit.ly/r-is-hot Revolution Confidential Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R 17

R is open source and drives analytic innovation but. Revolution Confidential has some limitations for Enterprises Big Data Speed of Analysis Enterprise Readiness Analytic Breadth & Depth In-memory bound Single threaded Community support 5,000+ innovative analytic packages Disk based scalability Parallel threading Commercial support Leverage open source packages plus Big Data-ready packages 18

Revolution R Enterprise High Performance, Multi-Platform Analytics Platform Revolution Confidential Revolution R Enterprise DeployR Web Services Software Development Kit DevelopR Integrated Development Environment Revolution Analytics Value-Add Components Providing Power and Scale to Open Source R Open Source R Plus Revolution Analytics performance enhancements ConnectR High Speed & Direct Connectors Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC ScaleR High Performance, Scalable, Portable, Parallelized, Full-Featured Big Data Analytics DistributeR Parallel & Distributed Computing Framework LSF, HPC Server, Azure Burst, Hadoop RevoR Performance Enhanced Open Source R + CRAN packages IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM BigInsights, Intel Hadoop, SMP servers, Teradata 19

Big Data Speed @ Scale with Revolution R Enterprise Revolution Confidential In-Hadoop Execution In-Database Execution Parallelized User Code Parallelized Algorithms Multi-Core Execution Multi-Threaded Execution Memory Management Fast Math Libraries

Our Objectives with Respect to Hadoop Provide the first enterprise-ready, commercially supported, full-featured, out-ofthe-box Predictive Analytics suite running in Hadoop Allow our customers to do predictive analytics as easily in Hadoop as they can using R on their workstations Scalable and High Performance 21

Simplicity Goal: Hadoop As An R Engine. Hadoop Run Revolution R Enterprise Code In Hadoop Without Change Provide ScaleR Pre-Parallelized Algorithms No Need To Think In MapReduce Eliminate Movement to Slash Latencies Expanded Deployment Options 22

Revolution R Enterprise ScaleR An R package that adds capabilities to R: Data Import/Clean/Explore/Transform Analytics Descriptive and Predictive Parallel and distributed computing Visualization Scales from small local data to huge distributed data Scales from workstation to server to cluster to cloud Portable the same code works on small and big data, and on workstation, server, cluster, Hadoop 23

High Performance Big Data Analytics with Revolution R Enterprise ScaleR R Data Step Descriptive Statistics Statistical Tests Sampling Predictive Models Data Visualization Machine Learning Simulation 24

ScaleR: High Performance Scalable Parallel External Memory Algorithms Data Prep, Distillation & Descriptive Analytics R Data Step Descriptive Statistics Statistical Tests Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means, sums) Use any of the functionality of the R language to transform and clean data row by row! Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Chi Square Test t-test F-Test Plus 1,000 s of other tests available in R! Sampling Subsample (observations & variables) Random Sampling High quality, fast, parallel random number generators 25

Revolution R Enterprise ScaleR: High Performance Big Data Analytics Revolution Confidential Predictive Models Statistical Modeling Data Visualization Machine Learning Cluster Analysis Covariance, Correlation, Sum of Squares (cross product matrix for set variables) matrices Multiple Linear Regression Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions. Logistic Regression Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models Histogram Line Plot Lorenz Curve ROC Curves (actual data and predicted values) Variable Selection Stepwise Regression PCA Simulation Parallel random number generators for Monte Carlo Use the rich functionality of R for simulations K-Means Decision Trees Decision Forests Classification 26

ScaleR Scalability and Performance Handles an arbitrarily large number of rows in a fixed amount of memory Scales linearly with the number of rows Scales linearly with the number of nodes Scales well with the number of cores per node Scales well with the number of parameters Extremely high performance 27

GLM comparison using in-memory data: glm() and ScaleR s rxglm() Revolution R Enterprise 28

Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Approach Platform Time to fit SAS 16-core Sun Server 5 hours rmr/mapreduce 10-node 80-core Hadoop Cluster > 10 hours R 250 GB Server Impossible (> 3 days) Revolution R Enterprise 5-node 20-core LSF cluster 5.7 minutes Generalized linear model, 150 million observations, 70 degrees of freedom http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html Revolution R Enterprise 29

SAS HPA Benchmarking comparison* Logistic Regression Rows of data 1 billion 1 billion Parameters just a few Double 7 Time 80 seconds 45% 44 seconds Data location In memory On disk Nodes 32 1/6 th 5 Cores 384 5% 20 RAM 1,536 GB 5% 80 GB Revolution R Enterprise is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20 th as much RAM, a 6 th as many nodes, and not pre-loading data into RAM. Revolution R Enterprise Delivers Performance at 2% of the Cost *As published by SAS in HPC Wire, April 21, 2011 30

Specific speed-related factors Efficient computational algorithms Efficient memory management minimize data copying and data conversion Heavy use of C++ templates; optimal code Efficient data file format; fast access by row and column Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities) Handle categorical variables efficiently Revolution R Enterprise 31

ScaleR Parallel External Memory Algorithms (PEMA s) The ScaleR analytics algorithms are all built on a platform (DistributeR) that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms These Parallel External Memory Algorithms (PEMA s) process data a chunk at a time in parallel across cores and nodes 1) Initialize, 2) Process Chunk, 3) Aggregate, 4) Finalize Revolution R Enterprise 32

Scalability and portability of Revolution Analytics implementation of PEMA s These PEMA algorithms can process an unlimited number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability They are independent of the compute context (number of cores, computers, distributed computing platform), giving portability across these dimensions They are independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 33

Simplified ScaleR Internal Architecture Analytics Engine PEMA s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV, Fixed, XDF Revolution R Enterprise 34

ScaleR on Hadoop Each pass through the data is one MapReduce job Prediction (Scoring), Transformation, Simulation: Map tasks store results in HDFS or return to client Statistics, Model Building, Visualization: Map tasks produce intermediate result objects that are aggregated by a Reduce task Master process decides if another pass through the data is required Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithms Revolution R Enterprise 35

Sample code for logit on workstation # Specify local data source airdata <- mylocaldatasource # Specify model formula and parameters rxlogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airdata ) 36

Sample code for logit on Hadoop # Change the compute context rxsetcomputecontext(myhadoopcluster) # Change the data source if necessary airdata <- myhadoopdatasource # Otherwise, the code is the same rxlogit(arrdelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airdata) 37

Demo rxlinmod in Hadoop - Launching Revolution R Enterprise 38

Demo rxlinmod in Hadoop - In Progress Revolution R Enterprise 39

Demo rxlinmod in Hadoop - Completed Revolution R Enterprise 40

Data Analytics Applications Revolution R Enterprise 7 on Hadoop Hadoop MapReduce Other MapReduce Jobs Revolution R Enterprise ScaleR Algorithms DistributedR Framework ConnectR: HDFS ODBC & High-Speed Connectors Analytics Server Compute or Cluster: Linux, Server/Cluster Windows, LSF or Azure Revolution R Enterprise DeployR ScaleR Algorithms DistributedR Framework ConnectR: ConnectR: HDFS Hbase ODBC & HDFS High-Speed ODBC & Connectors High-Speed Connectors BI and Browser Revolution R Enterprise EDW & Other Sources Revolution R Enterprise 7 on Hadoop and Analytics Clusters Right Tool For The Job RRE 7 Inside and Beside Hadoop Connect a Compute Server or Cluster to Hadoop When To Use: Production Hadoop Cluster Need Parallelized Algorithms Heavy Random Workloads Extensive Sandboxing Big Data Scoring Data Security Constraints Legacy Data Sources Advantages: Independent Scalability Flexibility Low Latency 41

Consulting Services and Training Revolution Confidential Services Remote & On site Projects & Staff Aug Quick Start Programs Entire project lifecycle Big Data Analytics Strategy Design & Architecture Use Case Definition Model Development & Deployment Support & Maintenance Training Comprehensive Topics Self Paced & Classroom Customizable R with Hadoop R for SAS Users Data Visualization Parallel Computing with RRE Big Data Analytics with RRE 42

Revolution Confidential Polling Question 3 43

Questions Contact Revolution Analytics at info@revolutionanalytics.com