Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise Revolution Webinar April 17, 2014 Mario Inchiosa, US Chief Scientist mario.inchiosa@revolutionanalytics.com All Rights Reserved, Revolution Analytics 2014
OUR COMPANY OUR SOFTWARE SOME KUDOS The leading provider of advanced analytics software and services based on open source R, since 2007 The only Big Data, Big Analytics software platform based on the data science language R Visionary Gartner Magic Quadrant for Advanced Analytics Platforms, 2014 2
Typical Challenges our Customers Face Big Data Many new data sources Data variety & velocity Data movement, memory limits Complex Computation Mathematically sophisticated Parallelization Experimentation Ensemble models Many small models Simulation Enterprise Readiness Heterogeneous landscape Write once, deploy anywhere Skill shortage Production support Production Efficiency Shorter model shelf life Volume of Models Long end-toend cycle time Pace of decision accelerated 3
Polling Question: What is your current analytics software platform? Please select one R/RRE SAS SPSS Tibco/Spotfire KXEN Other
OPEN SOURCE R
What is R? Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R R is Hot bit.ly/r-is-hot WHITE PAPER
Exploding growth and demand for R R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R R is the first choice of more data miners than any other software Source: www.rexeranalytics.com R is the highest paid IT skill Dice.com, Jan 2014 R most-used data science language after SQL O Reilly, Jan 2014 R is used by 70% of data miners Rexer, Sep 2013 R is #15 of all programming languages RedMonk, Jan 2014 R growing faster than any other data science language KDnuggets, Aug 2013 More than 2 million users worldwide
REVOLUTION R ENTERPRISE THE BIG DATA BIG ANALYTICS PLATFORM
Revolution R Enterprise is. the only big data big analytics platform based on open source R High Performance, Scalable Analytics Portable Across Enterprise Platforms Easier to Build & Deploy Analytics 9
R is open source and drives analytic innovation but.has some limitations for Enterprises Big Data In-memory bound Hybrid memory & disk scalability Operates on bigger volumes & factors Speed of Analysis Enterprise Readiness Single threaded Parallel threading Shrinks analysis time Community support Commercial support Delivers full service production support Analytic Breadth & Depth 5000+ innovative analytic packages Leverage open source packages plus Big Data ready packages Supercharges R 10
is the Big Data Big Analytics Platform All of Open Source R plus: Big Data scalability High-performance analytics Development and deployment tools Data source connectivity Application integration framework Multi-platform architecture Support, Training and Services 11
The Platform Step by Step: R Capabilities R+CRAN Open source R interpreter UPDATED R 3.0.2 Freely-available R algorithms Algorithms callable by RevoR Embeddable in R scripts 100% Compatible with existing R scripts, functions and packages RevoR Performance enhanced R interpreter Based on open source R Adds high-performance math Available On: IBM Platform LSF TM Clusters Microsoft HPC Clusters Windows & Linux Servers Windows & Linux Workstations NEW Cloudera Hadoop NEW Hortonworks Hadoop NEW Teradata Database Intel Hadoop IBM BigInsights TM IBM PureData TM for Analytics, powered by Netezza technology 12
Revolution R Enterprise RevoR Performance Enhanced R Open Source R Customers report Revolution 3-50x R performance improvements Enterprise compared to Open Source R without changing any code Computation (4-core laptop) Open Source R Revolution R Speedup Linear Algebra 1 Matrix Multiply 176 sec 9.3 sec 18x Cholesky Factorization 25.5 sec 1.3 sec 19x Linear Discriminant Analysis 189 sec 74 sec 3x General R Benchmarks 2 R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable 1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php 2. http://r.research.att.com/benchmarks/ 13
The Platform Step by Step: Parallelization & Data Sourcing ScaleR Ready-to-Use high-performance big data big analytics Fully-parallelized analytics Data prep & data distillation Descriptive statistics & statistical tests Correlation & covariance matrices Predictive Models linear, logistic, GLM Machine learning Monte Carlo simulation NEW Tools for distributing customized algorithms across nodes DistributedR Distributed computing framework Delivers portability across platforms ConnectR High-speed data import/export Available for: High-performance XDF SAS, SPSS, delimited & fixed format text data files Hadoop HDFS (text & XDF) Teradata Database TPT ODBC (incl. Vertica, Oracle, Pivotal, Aster, SybaseIQ, DB2, MySQL) DistributedR available on: Windows Servers Red Hat and NEW SuSE Linux Servers IBM Platform LSF Linux Clusters Microsoft HPC Clusters NEW Cloudera Hadoop NEW Hortonworks Hadoop NEW Teradata Database 14
The Platform Step by Step: Tools & Deployment DevelopR Integrated development environment for R Visual step-into debugger Available on: Windows DevelopR DeployR DeployR Web services software development kit for integration analytics via Java, JavaScript or.net APIs Integrates R Into application infrastructures Capabilities: Invokes R Scripts from web services calls RESTful interface for easy integration Works with web & mobile apps, leading BI & Visualization tools and business rules engines 15
010101010010101001010010010010100101010101 011010 100101 101001 111000 Scalable and Parallelized across Cores and Nodes DATA PARTITION COMPUTE NODE SHARED MEMORY 01010101001001010011100100100101001010101010101010100100 0101 01010 BIG DATA CORE0 THREAD 0 CORE01 THREAD 1 CORE02 THREAD 2 CORE03 THREAD N 0010 MULTICORE PROCESSOR 4, 8, 16+ CORES 0010 1110 MASTER NODE 1100 COMPUTE NODE COMPUTE NODE Combine Intermediate Results Evaluate COMPUTE NODE 16
ScaleR Scalability and Performance Handles an arbitrarily large number of rows in a fixed amount of memory Scales linearly with the number of rows Scales linearly with the number of nodes Scales well with the number of cores per node Scales well with the number of parameters Extremely high performance 17
Eliminates Performance and Capacity Limits of Open Source R and Legacy SAS Unique PEMAs: Parallel, external-memory algorithms High-performance, scalable replacements for R/SAS analytic functions Parallel/distributed processing eliminates CPU bottleneck Data streaming eliminates memory size limitations Works with in-memory and disk-based architectures 18
Revolution R Enterprise ScaleR Outperforms SAS HPA at a Fraction of the Cost Logistic Regression: Rows of data 1 billion 1 billion Parameters just a few 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB Revolution R is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20 th as much RAM, a 6 th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPCwire, April 21, 2011 See Revolution white paper for additional benchmarks. 20
Specific speed-related factors Efficient computational algorithms Efficient memory management minimize data copying and data conversion Heavy use of C++ templates; optimal code Efficient data file format; fast access by row and column Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities) Handle categorical variables efficiently Revolution R Enterprise 21
Scalability and portability of Revolution Analytics Parallel External Memory Algorithms (PEMAs) Anatomy of a PEMA: 1) Initialize, 2) Process Chunk, 3) Aggregate, 4) Finalize Process a chunk of data at a time, giving linear scalability Process an unlimited number of rows of data in a fixed amount of RAM Independent of the compute context (number of cores, computers, distributed computing platform), giving portability across these dimensions Independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 22
Simplified ScaleR Internal Architecture Analytics Engine PEMA s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files, UDFs Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV, Fixed, XDF Revolution R Enterprise 23
Write Once. Deploy Anywhere. Hadoop Cloudera Hortonworks EDW Teradata IBM PureData for Analytics ConnectR DeployR Clustered Systems Workstations & Servers IBM Platform LSF Microsoft HPC Windows Red Hat and SUSE Linux ScaleR In the Cloud Amazon AWS DistributedR DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 24
Decision Trees Easy-to-interpret models Widely used in a variety of disciplines. For example, Predicting which patient characteristics are associated with high risk of, for example, heart attack. Deciding whether or not to offer a loan to an individual based on individual characteristics. Predicting the rate of return of various investment strategies Retail target marketing Can handle multi-level factor response easily Useful in identifying important interactions Revolution R Enterprise 25
Decision Tree Types Classification tree: predict what class or group an observation belongs to (dependent variable is a factor) Regression tree: predict the value of a continuous dependent variable Revolution R Enterprise 26
Polling Question: What is your go-to tree algorithm for predictions in your work? Please select one answer Single Trees Random Forests It Depends Neither 27
Classification Example: Marketing Response Data set containing the following information: Response: Was response to a phone call, email, or mailing? Age Income Marital status Attended college? Revolution R Enterprise 28
Estimating the model treeout <- rxdtree(response ~ age + income + college + marital, data = rdata) Revolution R Enterprise 29
Simple Example: Text Output Information on the split, the number of observations in the node, the loss, the predicted value, and the probabilities 1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000) 2) college=no College 5074 2378 Phone (0.53133622 0.38943634 0.07922743) 4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639) 8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) * 9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) * 5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901) 10) marital=single 835 371 Phone (0.55568862 0.40958084 0.03473054) 20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) * 21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) * 11) marital=married 1721 87 Email(0.02556653 0.9494480.02498547) * 3) college=college 4926 971 Email (0.12789281 0.80288266 0.06922452) Revolution R Enterprise 30
Interactive HTML Graphics Revolution R Enterprise 31
The Big Data Decision Tree Algorithm Classical algorithms for building a decision tree sort all continuous variables in order to decide where to split the data. This sorting step becomes prohibitive when dealing with large data. rxdtree bins the data rather than sorting, computing histograms to create empirical distribution functions of the data rxdtree partitions the data horizontally, processing in parallel different subsets of the observations The accuracy of the parallel tree approximately equals that of the serial tree (Ben-Haim & Tom-Tov, 2010) Revolution R Enterprise 32
Useful rxdtree Arguments for Big Data maxdepth: maximum tree depth minbucket: minimum number of observations in a terminal node minsplit: minimum number of observations needed to split cp: minimum fit improvement needed to accept a split maxnumbins: maximum number of bins used to bin numeric data Revolution R Enterprise 33
Related Decision Tree Functions prune.rxdtree model simplification rxdtreebestcp optimizes pruning rxaddinheritance makes rxdtree compatible with rpart for printing and plotting createtreeview interactive HTML graphics rxpredict.rxdtree scores new data using rxdtree model rxdforest Ensembles of Decision Trees big data alternative to randomforest package rxvarimpplot plots variable importance as measured by rxdforest rxpredict.rxdforest scores new data using rxdforest model 34
Sample code for Decision Trees on workstation # Specify local data source airdata <- mylocaldatasource # Specify model formula and parameters rxdtree( ArrDelay ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + CRSDepTime, data=airdata ) 35
Sample code for Decision Trees on Hadoop # Change the compute context rxsetcomputecontext(myhadoopcluster) # Change the data source if necessary airdata <- myhadoopdatasource # Otherwise, the code is the same rxdtree( ArrDelay ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + CRSDepTime, data=airdata ) 36
Write Once Deploy Anywhere Set the desired compute context for data analytics execution.. Local System rxsetcomputecontext("local") # DEFAULT rxsetcomputecontext(rxlsfcluster(<data, server environment arguments>)) rxsetcomputecontext(rxhpcserver(<data, server environment arguments>)) rxsetcomputecontext(rxhadoopmr(<data, server environment arguments>)) rxsetcomputecontext(rxinteradata(<data, server environment arguments>)) Same code to be run anywhere.. # Summarize and calculate descriptive statistics adssummary <- rxsummary(~arrdelay+crsdeptime+dayofweek, data = airds) # Fit Linear Model arrdelaylm1 <- rxlinmod(arrdelay ~ DayOfWeek, data = airds)
Polling Question: What platforms are you most interested in running tree models on your data? Please select all that apply Server Grid Hadoop Teradata Other 38
Revolution R Enterprise ScaleR: High Performance Big Data Analytics Data Prep, Distillation & Descriptive Analytics R Data Step Descriptive Statistics Statistical Tests Sampling Data import Delimited, Fixed, SAS, SPSS, ODBC Variable creation & transformation using any R functions and packages Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means, sums) Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Subsample (observations & variables) Random Sampling
Revolution R Enterprise ScaleR (continued) Statistical Modeling Machine Learning Predictive Models Data Visualization Variable Selection Cluster Analysis Covariance/Correlation/Sum of Squares/Cross-product Matrix Multiple Linear Regression Logistic Regression Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. - User defined distributions & link functions. Classification & Regression Trees and Forests Residuals for all models Histogram ROC Curves (actual data and predicted values) Lorenz Curve Line and Scatter Plots NEW Tree Visualization Stepwise Regression Linear NEW logistic NEW GLM Simulation and HPC Monte Carlo Run open source R functions and packages across cores and nodes K-Means Classification Decision Trees NEW Decision Forests Deployment Prediction (scoring) NEW PMML Export
Resources For You Big Data Decision Trees with R http://www.revolutionanalytics.com/whitepaper/big-data-decision-trees-r Advanced, Big Data Analytics with R and Hadoop http://www.revolutionanalytics.com/whitepaper/advanced-big-data-analytics-rand-hadoop Revolution R Enterprise: Faster than SAS http://www.revolutionanalytics.com/whitepaper/revolution-r-enterprise-fastersas May 13, 2014 Webinar presenting the results of our RRE vs. SAS benchmarking To Get RRE for yourself, please visit: http://www.revolutionanalytics.com/get-revolution-r-enterprise 41
42
Thank you Revolution Analytics is the leading commercial provider of software and support for the popular open source R statistics language. mario.inchiosa@revolutionanalytics.com www.revolutionanalytics.com, 1.855.GET.REVO, Twitter: @RevolutionR 43