Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson



Similar documents
How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

High Performance Predictive Analytics in R and Hadoop:

Tips and Techniques for Efficiently Updating and Loading Data into SAS Visual Analytics

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

In-Memory Analytics for Big Data

SAS LASR Analytic Server 2.4

Paper SAS Techniques in Processing Data on Hadoop

What's New in SAS Data Management

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

Understanding the Benefits of IBM SPSS Statistics Server

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

See the Big Picture. Make Better Decisions. The Armanta Technology Advantage. Technology Whitepaper

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics

RevoScaleR Speed and Scalability

Driving Value from Big Data

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

WHAT S NEW IN SAS 9.4

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Best Practices for Implementing High Availability for SAS 9.4

SEIZE THE DATA SEIZE THE DATA. 2015

Dell* In-Memory Appliance for Cloudera* Enterprise

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Find the Hidden Signal in Market Data Noise

SQL Server 2012 Performance White Paper

A Survey of Shared File Systems

Advanced Big Data Analytics with R and Hadoop

2015 Workshops for Professors

DATA VISUALIZATION: CONVERTING INFORMATION TO DECISIONS DAVID FRONING, PRINCIPAL PRODUCT MANAGER

SAS IT Intelligence for VMware Infrastructure: Resource Optimization and Cost Recovery Frank Lieble, SAS Institute Inc.

Fast Analytics on Big Data with H20

Integrating Apache Spark with an Enterprise Data Warehouse

and Hadoop Technology

ANALYTICS IN BIG DATA ERA

Table of Contents. June 2010

Planning for the Worst SAS Grid Manager and Disaster Recovery

Intel Platform and Big Data: Making big data work for you.

Kronos Workforce Central 6.1 with Microsoft SQL Server: Performance and Scalability for the Enterprise

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

SAS 9.4 Intelligence Platform

SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform

In-Database Analytics Deep Dive with Teradata and Revolution R

Hadoop & SAS Data Loader for Hadoop

Integrated Grid Solutions. and Greenplum

SAS Visual Analytics 7.2 for SAS Cloud: Quick-Start Guide

Architectures for Big Data Analytics A database perspective

Manage SAS Metadata Server Availability with Hewlett-Packard Technology

SAS Data Set Encryption Options

ORACLE DATABASE 10G ENTERPRISE EDITION

Using In-Memory Computing to Simplify Big Data Analytics

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Working Together to Promote Business Innovations with Grid Computing

Revolution R Enterprise: Efficient Predictive Analytics for Big Data

Make Better Decisions with Optimization

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

In-Database Analytics

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

High-Performance Analytics

Oracle Big Data SQL Technical Update

Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

2009 Oracle Corporation 1

How to Optimize Your Data Mining Environment

SUGI 29 Systems Architecture. Paper

Numerix CrossAsset XL and Windows HPC Server 2008 R2

Cray: Enabling Real-Time Discovery in Big Data

Fast, Low-Overhead Encryption for Apache Hadoop*

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

UNIX Operating Environment

Technical Paper. Defining an ODBC Library in SAS 9.2 Management Console Using Microsoft Windows NT Authentication

2015 The MathWorks, Inc. 1

AcademyR Course Catalog

QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Technology Brochure New Technology for the Digital Consumer

Modeling Lifetime Value in the Insurance Industry

An In-Depth Look at In-Memory Predictive Analytics for Developers

HP reference configuration for entry-level SAS Grid Manager solutions

The HPSUMMARY Procedure: An Old Friend s Younger (and Brawnier) Cousin Anh P. Kellermann, Jeffrey D. Kromrey University of South Florida, Tampa, FL

ANALYTICS MODERNIZATION TRENDS, APPROACHES, AND USE CASES. Copyright 2013, SAS Institute Inc. All rights reserved.

Hadoop Size does Hadoop Summit 2013

Transcription:

Technical Paper Performance of SAS In-Memory Statistics for Hadoop A Benchmark Study Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Release Information Content Version: 1.0 May 20, 2014 Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. ii

Contents Executive Summary...1 Introduction...2 Construction of Proxy Data...3 Benchmark Methods...3 Computing Environment... 3 Benchmark Tasks... 4 Results...5 Conclusion...7 References...8 i

Executive Summary A recent benchmark study was undertaken by Revolution Analytics, including claims such as ScaleR outperformed SAS on every task and ScaleR ran the tasks 42 times faster than SAS (Dinsmore & Norton, 2014). However, the comparison made in the study was between Revolution R Enterprise s (RRE) Parallel External Memory Algorithms, a distributed process, to SAS procedures which were not run in distributed mode. To make a more just comparison, this benchmark study compared the tasks on a distributed analytic environment. That is, we constructed a data set of identical size to the one used in Revolution Analytics benchmark and ran the same tasks utilizing SAS In-Memory Statistics for Hadoop TM (PROC IMSTAT) on a cluster with an identical number of nodes to the hardware used in Revolution Analytics benchmark. Results indicate: With 5 million observations and 134 columns, PROC IMSTAT took a total of 12.56 seconds to complete all tasks. In comparison, RRE7 completed in 109.7 seconds. Thus, Revolution Analytics RRE7 took 8.7 times as long to run the same set of tasks as PROC IMSTAT. The individual tasks varied from 2.8 times as long to 40 times as long to run in RRE7 than with the SAS PROC IMSTAT. In all instances, PROC IMSTAT outperformed the RRE7 reported timings for both 1 million and 5 million observations of the data. Scoring a 50 million observation data set completed in 1.34 seconds. The comparable task in RRE7 took 21.5 times as long to complete. 1

Introduction The context for this study begins at the Strata Conference on October 25, 2012, where the research and planning division of a large insurance corporation presented various methods that they used to model 150 million observations of insurance data. A summary of their presentation is available at: http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html. In this performance benchmark, Revolution Analytics asserted their Parallel External Memory Algorithms (PEMA) resulted in vastly better performance for advanced analytics (Dinsmore & Norton, 2014). However, several readers voiced concern regarding the methodology used, and validity of the claims made, by Revolution Analytics. These readers pointed out that the Revolution Analytics tests were run on clustered computing environments, but that the SAS benchmark tests were not. In March 2014, a follow-up benchmark study was undertaken by Revolution Analytics to make a more fair comparison by running the tests on the same hardware. The 2014 benchmark included hiring a SAS consultant to review the programs and enable them for Grid computing. The second Revolution Analytics benchmark findings included claims such as ScaleR outperformed SAS on every task and ScaleR ran the tasks 42 times faster than SAS (Dinsmore & Norton, 2014). However, Dinsmore and Norton (2014) deployed SAS Release 9.4 with base SAS, SAS/STAT, and SAS Grid Manager as the major components. They used a desktop machine running SAS Management Console and SAS Enterprise Guide as the Grid Client. Despite enabling the Grid, SAS procedures running on a single node were compared to distributed Revolution Analytics algorithms. The one instance in which distributed SAS procedures were compared (i.e., PROC HPREG), the SAS High Performance Analytics Server was not utilized. In this case, the benefits of the High Performance procedures cannot be fully realized. While we applaud the attempt to make a more fair comparison between Revolution Analytics and SAS products, and Revolution Analytics transparency by posting the SAS code used to run the procedures (posted at https://github.com/revolution AnalyticslutionAnalytics/Benchmark), the benchmark is still not an evaluation using comparable computing environments. The computing environments used in the 2014 Revolution Analytics benchmark remain dramatically different despite their intentions to provide a more just comparison. Dinsmore and Norton (2014) concluded that SAS/STAT software was slower than RRE because of the way in which SAS/STAT swaps data between memory and disk when a data set is larger than memory, a process which can be slower than in-memory operations. In contrast, RRE uses Parallel External Memory Algorithms (PEMA) to distribute operations over multiple machines in a clustered architecture. When a data set is larger than memory on any single machine, rather than swap to disk, RRE distributes the data across all available computing resources. This, Dinsmore and Norton (2014) claim, is the reason behind the vastly different timings. A more fruitful and just comparison can be made comparing SAS distributed procedures to RRE distributed algorithms. The purpose of this benchmark is to make such a comparison. We generated a data set comparable to the one described in the 2014 Revolution Analytics benchmark and performed a set of tests using SAS LASR Analytic Server and SAS In-Memory Statistics for Hadoop TM. The remainder of the paper discusses construction of the proxy data, a description of the SAS LASR Analytic Server and SAS In-Memory Statistics for Hadoop TM, benchmark procedures, results, and conclusions. 2

Construction of Proxy Data Three data sets were generated to mimic the properties of those used in the Dinsmore and Norton (2014) study in terms of row and column size. The row counts of these data sets are one million, five million and 50 million respectively. Each table contains 134 columns. All data generation was performed using the IMSTAT procedure on the SAS LASR Analytic Server. Benchmark Methods Computing Environment SAS LASR Analytic Server is an in-memory engine which has been designed to address advanced analytics in a scalable manner. It is an in-memory analytics engine that provides secure, multiuser, concurrent access to any size data. The SAS LASR Analytic Server is a dedicated, multipass analytical server. The SAS In-Memory Statistics for Hadoop TM procedure (PROC IMSTAT) moves all of the data into dedicated memory. The main advantage is being able to analyze all of the data in the shortest amount of time. The software is optimized for distributed, multithreaded architectures and scalable processing, so requests to run new scenarios or complex analytical computations are handled very fast. This benchmark demonstrates just how fast some common analytical procedures can be performed. PROC IMSTATuses in-memory analytics technology to perform analyses that range from data exploration, visualization and descriptive statistics to model building with advanced statistical and machine learning algorithms and scoring new data. Revolution Analytics used a clustered computing environment consisting of five, four-core machines running CentOS, all networked using Gigabit Ethernet connections and a separate NFS Server. Revolution R Enterprise Release 7 (RRE7) was installed on each node. To make a valid comparison, all tasks run within PROC IMSTAT on the SAS LASR Analytic Server used five nodes as well (one name node and four data nodes). 3

Benchmark Tasks The set of tasks included in the benchmark are provided in Table 1. Task RRE 7 Capability SAS PROC IMSTAT Descriptive statistics (n, min, max, mean, std) on 1 numeric variable rxsummary summary Median and deciles for 1 numeric variable rxquantile percentile Frequency distribution for 1 text variable rxcube frequency Linear regression with 1 numeric response and 20 numeric predictors, with score code generated rxlinmod glm Linear regression with 1 numeric response and 10 numeric predictors and 10 categorical predictors rxlinmod glm Stepwise linear regression with 100 numeric predictors rxlinmod -- Logistic regression with 1 binary response variable and 20 numeric predictors rxlogit logistic Generalized linear model with numeric response variable, 20 numeric predictors, gamma distribution and link function rxglm genmodel k-means clustering with 20 active variables rxkmeans cluster k-means clustering with 100 active variables rxkmeans cluster Table 1 Benchmark Tasks Example script for computing frequencies in PROC IMSTAT is found below. For a more comprehensive discussion on the SAS LASR Analytic Server and SAS In-Memory Statistics for Hadoop TM, please see the SAS LASR Analytic Server reference guide and the PROC IMSTAT documentation (SAS Institute Inc., 2014). proc lasr create port=&myport path="/tmp"; performance nodes=4; run; libname lasr sasiola port=&myport tag='work'; data lasr.data1m; set &data1m.; run; proc imstat; table lasr.organics; frequency DemTVReg; run; 4

A distributioninfo statement can provide information about how the data are spread across the nodes. The following information is provided to show the user how the 5,000,175 rows of data are distributed across the nodes. This information is provided in Table 2 below. Nodes Number of Partitions Number of Records node48 0 1250044 node49 0 1250044 node50 0 1250044 node51 0 1250043 Table 2 Distribution of 5 Million Observations Across 4 Nodes Results Table 3 shows complete time to run results, in seconds, using the larger data set of five million records. PROC IMSTAT took a total of 12.56 seconds to complete. This is in comparison to RRE7, which took 109.7 seconds to complete. This time includes the sum of all times reported in Dinsmore and Norton (2014) minus the time for the stepwise linear regression task as SAS In-Memory Statistics for Hadoop TM has yet to implement stepwise regression. Thus, Revolution Analytics RRE7 took 8.73 times as long to run the same set of tasks as PROC IMSTAT. The individual tasks varied from 2.8 times as long to 40 times as long to run in RRE7 than with PROC IMSTAT. In all instances, PROC IMSTAT outperformed the RRE7 reported timings across a set of representative tasks representing end-to-end life cycle analytics. 5

Task RRE 7 SAS PROC IMSTAT How Much Faster is SAS? Descriptive statistics (n, min, max, mean, std) on 1 numeric variable 1.2 0.03 40x Median and deciles for 1 numeric variable 1.4 0.11 12.72x Frequency distribution for 1 text variable 0.8 0.03 26.7x Linear regression with 1 numeric response and 20 numeric predictors,, with score code generated 6.8 2.43 2.8x Linear regression with 1 numeric response and 10 numeric predictors and 10 categorical predictors 7.3 0.55 13.2x Stepwise linear regression with 100 numeric predictors 13.9 -- -- Logistic regression with 1 binary response variable and 20 numeric predictors 16.9 1.10 15.4x Generalized linear model with numeric response variable, 20 numeric predictors, gamma distribution and link function 32.7 5.49 6x k-means clustering with 20 active variables 10.1 0.64 15.8x k-means clustering with 100 active variables 32.5 2.18 14.9x Table 3 Time to Run (Seconds) Table 4 provides the overall time to run for both the 5 million observations and 1 million observations data. Using the first linear regression model (with 20 numeric predictors), 50 million observations were scored using PROC IMSTAT in 1.34 seconds. A comparable task in RRE7 took 28.8 seconds, over 21 times as long. Data Set Size Total Time for Tasks 1 Million rows 4.80 5 Million rows 12.56 Table 4 Total Time to Run (Seconds) 6

Conclusion This study has attempted to make a benchmark comparison between SAS In-Memory Statistics for Hadoop TM, a distributed computing environment, and Revolution Analytics Grid distributed computing environment. Results show that the SAS In-Memory Statistics for Hadoop TM time to run the reported tasks were all faster than the Revolution Analytic counterparts. These results are in contrast to those reported in a 2014 benchmark by Dinsmore and Norton (2014). One reason for the conflicting results between the two benchmarks is that the Dinsmore and Norton (2014) benchmark used Revolution Analytics distributed computing environment, PEMA, but contrasted results with (a) SAS High- Performance procedures not run on the SAS High Performance Analytics Server or (b) non-distributed procedures. This severely limited the comparability of procedures. One limitation of this study is that we were only able to use a proxy data set to the one used in the Revolution Analytics benchmark. However, the data sizes (number of rows and columns) between the two studies were identical. A next step may include ensuring the exact data generated by Revolution Analytics is used. Despite this, we feel that the results provided in this study provide a more clear comparison between the two analytics solutions. If speed matters, as claimed by Dinsmore and Norton (2014), then the SAS In-Memory Statistics for Hadoop TM provide a clear advantage for advanced analytics customers. We would like to thank the SAS Enterprise Excellence Center and Business Intelligence Research and Development teams in their assistance securing hardware assets and installing software for the tests performed in this benchmark study. 7

References Dinsmore, Thomas, & Norton, Derek (2014). Revolution R Enterprise: Faster than SAS. Available at http://www.revolutionanalytics.com/sites/default/files/revolution-analytics-sas-benchmark-whitepaper-mar2014.pdf. SAS Institute Inc. 2014. SAS LASR Analytic Server 2.3: Reference Guide. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/inmsref/67306/pdf/default/inmsref.pdf. SAS Institute Inc. 2014. IMSTAT Procedure (Analytics). Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/inmsref/67306/html/default/viewer.htm#n1l5k6bed95vzqn1a47vafe3q 958.htm. SAS Institute Inc. 2014. IMSTAT Procedure (Data and Server Management). Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/inmsref/67306/html/default/viewer.htm#p10dosb1fybvpzn1hw38gxuot opk.htm. Smith, David. (2012). Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Available at http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html. 8

To contact your local SAS office, please visit: sas.com/offices SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright 2014, SAS Institute Inc. All rights reserved.