Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Technical Paper Performance of SAS In-Memory Statistics for Hadoop A Benchmark Study Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Release Information Content Version: 1.0 May 20, 2014 Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. ii

Contents Executive Summary...1 Introduction...2 Construction of Proxy Data...3 Benchmark Methods...3 Computing Environment... 3 Benchmark Tasks... 4 Results...5 Conclusion...7 References...8 i

Executive Summary A recent benchmark study was undertaken by Revolution Analytics, including claims such as ScaleR outperformed SAS on every task and ScaleR ran the tasks 42 times faster than SAS (Dinsmore & Norton, 2014). However, the comparison made in the study was between Revolution R Enterprise s (RRE) Parallel External Memory Algorithms, a distributed process, to SAS procedures which were not run in distributed mode. To make a more just comparison, this benchmark study compared the tasks on a distributed analytic environment. That is, we constructed a data set of identical size to the one used in Revolution Analytics benchmark and ran the same tasks utilizing SAS In-Memory Statistics for Hadoop TM (PROC IMSTAT) on a cluster with an identical number of nodes to the hardware used in Revolution Analytics benchmark. Results indicate: With 5 million observations and 134 columns, PROC IMSTAT took a total of 12.56 seconds to complete all tasks. In comparison, RRE7 completed in 109.7 seconds. Thus, Revolution Analytics RRE7 took 8.7 times as long to run the same set of tasks as PROC IMSTAT. The individual tasks varied from 2.8 times as long to 40 times as long to run in RRE7 than with the SAS PROC IMSTAT. In all instances, PROC IMSTAT outperformed the RRE7 reported timings for both 1 million and 5 million observations of the data. Scoring a 50 million observation data set completed in 1.34 seconds. The comparable task in RRE7 took 21.5 times as long to complete. 1

Introduction The context for this study begins at the Strata Conference on October 25, 2012, where the research and planning division of a large insurance corporation presented various methods that they used to model 150 million observations of insurance data. A summary of their presentation is available at: http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html. In this performance benchmark, Revolution Analytics asserted their Parallel External Memory Algorithms (PEMA) resulted in vastly better performance for advanced analytics (Dinsmore & Norton, 2014). However, several readers voiced concern regarding the methodology used, and validity of the claims made, by Revolution Analytics. These readers pointed out that the Revolution Analytics tests were run on clustered computing environments, but that the SAS benchmark tests were not. In March 2014, a follow-up benchmark study was undertaken by Revolution Analytics to make a more fair comparison by running the tests on the same hardware. The 2014 benchmark included hiring a SAS consultant to review the programs and enable them for Grid computing. The second Revolution Analytics benchmark findings included claims such as ScaleR outperformed SAS on every task and ScaleR ran the tasks 42 times faster than SAS (Dinsmore & Norton, 2014). However, Dinsmore and Norton (2014) deployed SAS Release 9.4 with base SAS, SAS/STAT, and SAS Grid Manager as the major components. They used a desktop machine running SAS Management Console and SAS Enterprise Guide as the Grid Client. Despite enabling the Grid, SAS procedures running on a single node were compared to distributed Revolution Analytics algorithms. The one instance in which distributed SAS procedures were compared (i.e., PROC HPREG), the SAS High Performance Analytics Server was not utilized. In this case, the benefits of the High Performance procedures cannot be fully realized. While we applaud the attempt to make a more fair comparison between Revolution Analytics and SAS products, and Revolution Analytics transparency by posting the SAS code used to run the procedures (posted at https://github.com/revolution AnalyticslutionAnalytics/Benchmark), the benchmark is still not an evaluation using comparable computing environments. The computing environments used in the 2014 Revolution Analytics benchmark remain dramatically different despite their intentions to provide a more just comparison. Dinsmore and Norton (2014) concluded that SAS/STAT software was slower than RRE because of the way in which SAS/STAT swaps data between memory and disk when a data set is larger than memory, a process which can be slower than in-memory operations. In contrast, RRE uses Parallel External Memory Algorithms (PEMA) to distribute operations over multiple machines in a clustered architecture. When a data set is larger than memory on any single machine, rather than swap to disk, RRE distributes the data across all available computing resources. This, Dinsmore and Norton (2014) claim, is the reason behind the vastly different timings. A more fruitful and just comparison can be made comparing SAS distributed procedures to RRE distributed algorithms. The purpose of this benchmark is to make such a comparison. We generated a data set comparable to the one described in the 2014 Revolution Analytics benchmark and performed a set of tests using SAS LASR Analytic Server and SAS In-Memory Statistics for Hadoop TM. The remainder of the paper discusses construction of the proxy data, a description of the SAS LASR Analytic Server and SAS In-Memory Statistics for Hadoop TM, benchmark procedures, results, and conclusions. 2

Construction of Proxy Data Three data sets were generated to mimic the properties of those used in the Dinsmore and Norton (2014) study in terms of row and column size. The row counts of these data sets are one million, five million and 50 million respectively. Each table contains 134 columns. All data generation was performed using the IMSTAT procedure on the SAS LASR Analytic Server. Benchmark Methods Computing Environment SAS LASR Analytic Server is an in-memory engine which has been designed to address advanced analytics in a scalable manner. It is an in-memory analytics engine that provides secure, multiuser, concurrent access to any size data. The SAS LASR Analytic Server is a dedicated, multipass analytical server. The SAS In-Memory Statistics for Hadoop TM procedure (PROC IMSTAT) moves all of the data into dedicated memory. The main advantage is being able to analyze all of the data in the shortest amount of time. The software is optimized for distributed, multithreaded architectures and scalable processing, so requests to run new scenarios or complex analytical computations are handled very fast. This benchmark demonstrates just how fast some common analytical procedures can be performed. PROC IMSTATuses in-memory analytics technology to perform analyses that range from data exploration, visualization and descriptive statistics to model building with advanced statistical and machine learning algorithms and scoring new data. Revolution Analytics used a clustered computing environment consisting of five, four-core machines running CentOS, all networked using Gigabit Ethernet connections and a separate NFS Server. Revolution R Enterprise Release 7 (RRE7) was installed on each node. To make a valid comparison, all tasks run within PROC IMSTAT on the SAS LASR Analytic Server used five nodes as well (one name node and four data nodes). 3

Benchmark Tasks The set of tasks included in the benchmark are provided in Table 1. Task RRE 7 Capability SAS PROC IMSTAT Descriptive statistics (n, min, max, mean, std) on 1 numeric variable rxsummary summary Median and deciles for 1 numeric variable rxquantile percentile Frequency distribution for 1 text variable rxcube frequency Linear regression with 1 numeric response and 20 numeric predictors, with score code generated rxlinmod glm Linear regression with 1 numeric response and 10 numeric predictors and 10 categorical predictors rxlinmod glm Stepwise linear regression with 100 numeric predictors rxlinmod -- Logistic regression with 1 binary response variable and 20 numeric predictors rxlogit logistic Generalized linear model with numeric response variable, 20 numeric predictors, gamma distribution and link function rxglm genmodel k-means clustering with 20 active variables rxkmeans cluster k-means clustering with 100 active variables rxkmeans cluster Table 1 Benchmark Tasks Example script for computing frequencies in PROC IMSTAT is found below. For a more comprehensive discussion on the SAS LASR Analytic Server and SAS In-Memory Statistics for Hadoop TM, please see the SAS LASR Analytic Server reference guide and the PROC IMSTAT documentation (SAS Institute Inc., 2014). proc lasr create port=&myport path="/tmp"; performance nodes=4; run; libname lasr sasiola port=&myport tag='work'; data lasr.data1m; set &data1m.; run; proc imstat; table lasr.organics; frequency DemTVReg; run; 4

A distributioninfo statement can provide information about how the data are spread across the nodes. The following information is provided to show the user how the 5,000,175 rows of data are distributed across the nodes. This information is provided in Table 2 below. Nodes Number of Partitions Number of Records node48 0 1250044 node49 0 1250044 node50 0 1250044 node51 0 1250043 Table 2 Distribution of 5 Million Observations Across 4 Nodes Results Table 3 shows complete time to run results, in seconds, using the larger data set of five million records. PROC IMSTAT took a total of 12.56 seconds to complete. This is in comparison to RRE7, which took 109.7 seconds to complete. This time includes the sum of all times reported in Dinsmore and Norton (2014) minus the time for the stepwise linear regression task as SAS In-Memory Statistics for Hadoop TM has yet to implement stepwise regression. Thus, Revolution Analytics RRE7 took 8.73 times as long to run the same set of tasks as PROC IMSTAT. The individual tasks varied from 2.8 times as long to 40 times as long to run in RRE7 than with PROC IMSTAT. In all instances, PROC IMSTAT outperformed the RRE7 reported timings across a set of representative tasks representing end-to-end life cycle analytics. 5

Task RRE 7 SAS PROC IMSTAT How Much Faster is SAS? Descriptive statistics (n, min, max, mean, std) on 1 numeric variable 1.2 0.03 40x Median and deciles for 1 numeric variable 1.4 0.11 12.72x Frequency distribution for 1 text variable 0.8 0.03 26.7x Linear regression with 1 numeric response and 20 numeric predictors,, with score code generated 6.8 2.43 2.8x Linear regression with 1 numeric response and 10 numeric predictors and 10 categorical predictors 7.3 0.55 13.2x Stepwise linear regression with 100 numeric predictors 13.9 -- -- Logistic regression with 1 binary response variable and 20 numeric predictors 16.9 1.10 15.4x Generalized linear model with numeric response variable, 20 numeric predictors, gamma distribution and link function 32.7 5.49 6x k-means clustering with 20 active variables 10.1 0.64 15.8x k-means clustering with 100 active variables 32.5 2.18 14.9x Table 3 Time to Run (Seconds) Table 4 provides the overall time to run for both the 5 million observations and 1 million observations data. Using the first linear regression model (with 20 numeric predictors), 50 million observations were scored using PROC IMSTAT in 1.34 seconds. A comparable task in RRE7 took 28.8 seconds, over 21 times as long. Data Set Size Total Time for Tasks 1 Million rows 4.80 5 Million rows 12.56 Table 4 Total Time to Run (Seconds) 6

Conclusion This study has attempted to make a benchmark comparison between SAS In-Memory Statistics for Hadoop TM, a distributed computing environment, and Revolution Analytics Grid distributed computing environment. Results show that the SAS In-Memory Statistics for Hadoop TM time to run the reported tasks were all faster than the Revolution Analytic counterparts. These results are in contrast to those reported in a 2014 benchmark by Dinsmore and Norton (2014). One reason for the conflicting results between the two benchmarks is that the Dinsmore and Norton (2014) benchmark used Revolution Analytics distributed computing environment, PEMA, but contrasted results with (a) SAS High- Performance procedures not run on the SAS High Performance Analytics Server or (b) non-distributed procedures. This severely limited the comparability of procedures. One limitation of this study is that we were only able to use a proxy data set to the one used in the Revolution Analytics benchmark. However, the data sizes (number of rows and columns) between the two studies were identical. A next step may include ensuring the exact data generated by Revolution Analytics is used. Despite this, we feel that the results provided in this study provide a more clear comparison between the two analytics solutions. If speed matters, as claimed by Dinsmore and Norton (2014), then the SAS In-Memory Statistics for Hadoop TM provide a clear advantage for advanced analytics customers. We would like to thank the SAS Enterprise Excellence Center and Business Intelligence Research and Development teams in their assistance securing hardware assets and installing software for the tests performed in this benchmark study. 7

References Dinsmore, Thomas, & Norton, Derek (2014). Revolution R Enterprise: Faster than SAS. Available at http://www.revolutionanalytics.com/sites/default/files/revolution-analytics-sas-benchmark-whitepaper-mar2014.pdf. SAS Institute Inc. 2014. SAS LASR Analytic Server 2.3: Reference Guide. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/inmsref/67306/pdf/default/inmsref.pdf. SAS Institute Inc. 2014. IMSTAT Procedure (Analytics). Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/inmsref/67306/html/default/viewer.htm#n1l5k6bed95vzqn1a47vafe3q 958.htm. SAS Institute Inc. 2014. IMSTAT Procedure (Data and Server Management). Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/inmsref/67306/html/default/viewer.htm#p10dosb1fybvpzn1hw38gxuot opk.htm. Smith, David. (2012). Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Available at http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html. 8

To contact your local SAS office, please visit: sas.com/offices SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright 2014, SAS Institute Inc. All rights reserved.