RevoScaleR Speed and Scalability



Similar documents
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Advanced Big Data Analytics with R and Hadoop

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

Understanding the Benefits of IBM SPSS Statistics Server

Fast Analytics on Big Data with H20

Driving Value from Big Data

How To Handle Big Data With A Data Scientist

Using In-Memory Computing to Simplify Big Data Analytics

Bringing Big Data Modelling into the Hands of Domain Experts

Benchmarking Hadoop & HBase on Violin

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

High Performance Predictive Analytics in R and Hadoop:

Informatica Ultra Messaging SMX Shared-Memory Transport

UpStream Software s Big Data Analytics Platform for Marketing Optimization Helps Clients Understand Buying Behavior and Improve Customer Targeting

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

The Rise of Big Data Spurs a Revolution in Big Analytics

Big Data Analysis with Revolution R Enterprise

Benchmarking Cassandra on Violin

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Infrastructure Matters: POWER8 vs. Xeon x86

Architectures for Big Data Analytics A database perspective

RAID for the 21st Century. A White Paper Prepared for Panasas October 2007

CSE-E5430 Scalable Cloud Computing Lecture 2

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Hadoop Architecture. Part 1

PARALLELS CLOUD SERVER

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Scala Storage Scale-Out Clustered Storage White Paper

Actian Vector in Hadoop

Binary search tree with SIMD bandwidth optimization using SSE

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Rackspace Cloud Databases and Container-based Virtualization

Table of Contents. June 2010

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

GigaSpaces Real-Time Analytics for Big Data

Data Center Solutions

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Client/Server Computing Distributed Processing, Client/Server, and Clusters

[x+1] Completes Next-Generation POE; Its Origin Enterprise Data Management Platform for Automated, Big Data-Driven Marketing Optimization

Make the Most of Big Data to Drive Innovation Through Reseach

Hadoop: Embracing future hardware

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

I N T E R S Y S T E M S W H I T E P A P E R INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES. David Kaaret InterSystems Corporation

Performance And Scalability In Oracle9i And SQL Server 2000

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Big Fast Data Hadoop acceleration with Flash. June 2013

DataStax Enterprise, powered by Apache Cassandra (TM)

In-Database Analytics

Data Aggregation and Cloud Computing

DAS, NAS or SAN: Choosing the Right Storage Technology for Your Organization

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

ioscale: The Holy Grail for Hyperscale

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Energy Efficient MapReduce

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

Big-data Analytics: Challenges and Opportunities

R at the front end and

Scaling Web Applications on Server-Farms Requires Distributed Caching

Cluster Computing at HRI

Cloud Computing at Google. Architecture

From Big Data, Data Mining, and Machine Learning. Full book available for purchase here.

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Recommended hardware system configurations for ANSYS users

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

Reducing Storage TCO With Private Cloud Storage

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

The Power of Predictive Analytics

The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server

16.1 MAPREDUCE. For personal use only, not for distribution. 333

In-Memory Analytics for Big Data

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

Big Data and Data Science: Behind the Buzz Words

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

A survey on platforms for big data analytics

Microsoft Windows Server Hyper-V in a Flash

Hadoop Cluster Applications

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Parallel Processing of cluster by Map Reduce

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Delivering value from big data with Microsoft R Server and Hadoop

ANALYTICS IN BIG DATA ERA

How to Choose your Red Hat Enterprise Linux Filesystem

SQL Server Virtualization

HadoopTM Analytics DDN

Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL

Why Big Data in the Cloud?

QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Networking in the Hadoop Cluster

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Transcription:

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution R Enterprise, is designed from the ground up to be fast and scalable. Consideration has been give to all of the components that are involved in performing large-scale statistical analysis. These include data storage, usage of a computing infrastructure s resources (RAM, CPUs, cores, and computers) and the algorithms themselves. Its extreme speed and scalability are the result of careful, innovative engineering at every stage. This white paper describes the design and implementation considerations that are the foundation of the high-performance Big Data capabilities of Revolution R Enterprise. Executive Summary Analytics-driven breakthroughs in every field from healthcare to financial services have put demand for advanced analytics front and center for large and small organizations. As in any IT deployment, IT leaders supporting analytics environments have been challenged by tradeoffs among cost, performance and functionality. These tradeoffs are becoming more problematic due to exploding data volumes and the increasing numbers of people who recognize the potential impact that advanced analytics could have and are requesting analytics solutions that exceed the capabilities of existing tools. How can IT create an analytics infrastructure that will grow with the organization s needs? For the past several decades, the rising tide of technology especially the increasing speed of single processors has allowed the same data analysis code from legacy analytics software vendors to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM and legacy code can t keep up. To allow analytics to realize its potential for organizational improvements and handle very large and growing data sets, IT leaders need scalable data analysis software that is able to run on newer hardware paradigms, specifically using multiple cores, multiple hard drives, and multiple computers. The data analysis software needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds. 1

Revolution Analytics offers enterprise-grade, terabyte-class software based on the Open Source project R. This white paper discusses the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR, specifically exploring: Storing data Reading and writing of chunks of data Handling data in memory Using multiple cores on single computers Please share this white paper with the people on your team who are responsible for collecting, storing, managing, analyzing and extracting value from data. 1. Storing Data One of the keys to being scalable is the ability to process more data than can fit into memory at one time. This essentially equates to being able to work with chunks of data instead of requiring the entire dataset to be resident in memory at once. In the context of RevoScaleR, chunks are defined as sequential blocks of rows for a given selection of columns. Although RevoScaleR can process data from a wide variety of sources, it has its own highly optimized file format (the "XDF" format) that is especially suitable for chunking. Data in an XDF file can be accessed rapidly by row or by column. In addition, blocks of contiguous rows for selected columns can be read sequentially, rather than randomly. Sequential reads can be tens to hundreds of thousands of times faster than random reads. Furthermore, in an XDF file the time it takes to read a block of rows for a variable is essentially independent of the total number of variables and rows in the file. This means that even in terabyte-sized files, only the data for the actual variables required for an analysis needs to be read and processed, and this may only be a few hundred megabytes. The time it takes to do that is essentially the same as if only that data were stored in the file; storing the additional unused data does not add to the processing time. Data in an XDF file is stored in the same binary format that is used in memory, so no conversion is required when it is brought into memory. In order to minimize wasted space, it can also be stored in an appropriately-sized way. For instance, values that have no more than 256 values can be stored in a byte per number, rather than in 8 bytes as is the case with some data analysis programs. Floating point values with a precision of less than 6 or 7 decimal digits, which is commonly the case, can be stored in 4 bytes per number, not 8. New variables and new rows can be added to the file without having to rewrite the entire file. Thus, the cost of creating new variables and of adding more observations is greatly reduced. 2

2. Reading Data When data is read in "chunks," the optimal chunk size depends upon a variety of factors, such as the speed of the disk, the speed of RAM, the number and speed of cores, and the types of computations being done. RevoScaleR allows the size of chunks to vary depending upon conditions. A bottleneck for data processing is data I/O: reading the data from disk. RevoScaleR dedicates one core to reading data from disk, to avoid disk contention and optimize bandwidth for data/io. Meanwhile, the remaining cores are assigned to processing the chunk of data read into memory from the previous read. Of course, when it is possible to fit all data into memory, RevoScaleR allows that. It then assigns all cores to process that data. 3. Handling Data in Memory As on disk, use of the appropriate-sized data type in memory reduces the space required and also reduces the time it takes to move the data around in memory. In RevoScaleR, the amount of data conversion and copying is minimized, to save time and speed. In almost all other data-oriented programs, before an array of integers and an array of double precision floating point numbers are added together, the array of integers is first converted and copied into an array of doubles. This takes time and space. In RevoScaleR, that is almost never necessary, regardless of the type of operation and the sizes of the data types. No conversion or copying is done until the values are actually loaded into the CPU. 4. Use of Multiple Cores on a Single Computer Nearly all computations that involve data in RevoScaleR are automatically "threaded" that is they use multiple cores on a machine when they are available. This is done efficiently by minimizing the overhead of transferring the computations to multiple threads, by minimizing the amount of data that must be copied, by doing as much work as possible on each thread to amortize the cost of initializing the computations, and by minimizing inter-thread communication and synchronization. Feeding large chunks of data to each of the multiple cores is important for efficiency. For analytic routines such as descriptive statistics, crosstabs, linear regression, logistic regression, and K- means clustering (in which several variables are typically used) a large chunk of observations perhaps millions for all of the variables is read into memory by one core. Simultaneously, the data chunk from the previous read is "virtually" split among the remaining cores for the required processing. The code doing the processing on each core (thread) only needs to know what its assigned task is, and no inter-thread communication and synchronization is needed. As a simple example, consider computing the mean of several variables. Millions of observations of each of those variables might be read by the I/O thread and then each of the other threads is given a proportionate share of the observations. Each computational thread just needs to compute and store the sum of each of the variables for its share of the observations, and to 3

record how many total observations it used. To get the means for the entire data set, the partial sums and partial observation counts are aggregated, and the grand sums are divided by the total number of observations. Figure 1 RevoScaleR on Single Computer A RevoScaleR algorithm is provided a data source as input The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0). Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory When all of the data is processed a master results object is created from the intermediate results objects 5. Use of Multiple Computers A key to efficiently using multiple computers is to minimize the amount of information including data that must be communicated among the computers. In RevoScaleR, one of the computers (the master node) controls the computations on all of the other computers. It first sends a message to each compute node telling it where to find the data to use, and what types of computations to do. On each computer, multiple cores are used as described above, to maximize the efficiency of the node. The intermediate results from all cores are aggregated on that node, and only that information is sent back to the master node. The master node monitors the status of the compute nodes, aggregates the overall results sent back by those nodes, and then processes those results to get overall estimates. The final processing often involves computeintensive operations such as solving large sets of equations. RevoScaleR allows several options for getting data to the cores on each node, including reading data from a common data server, but it is generally most efficient to have the portion of data needed by each node stored locally. For iterative algorithms that require many passes through the data, such as logistic regression and K-means clustering, the master node controls the iterations. This is done by repeating the steps 4

described above: each iteration is initialized by a message from the master node, which aggregates the results that come back, computes the next set of estimated parameters, and decides whether the algorithm has converged. If not, another iteration is started. Figure 2 RevoScaleR on Multiple Computers Portions of the data source are made available to each compute node RevoScaleR on the master node assigns a task to each compute node, and the sleeping instance of RevoScaleR on the compute node wakes up. RevoScaleR on each compute node independently processes its data, and returns it s intermediate results back to RevoScaleR on the master node The RevoScaleR on the master node aggregates all of the intermediate results from each compute node and produces the final result 6. Efficient Parallelization of Statistical and Data Mining Algorithms RevoScaleR is built upon a platform designed to automatically and efficiently parallelize "external memory" algorithms. This is the class of algorithms that do not require all data to be in memory at one time, and such algorithms are available for a wide range of statistical and data mining routines. The way in which these algorithms are automatically parallelized is such that, in general, the fastest algorithms per core are also the fastest when parallelized. (This happy situation is not the case for some other types of parallel algorithms). Since the burden of worrying about parallelization is removed from the engineers implementing these algorithms, they can focus on getting optimal speed on each core. This involves several things. Most obviously, it involves using fast algorithms, and carefully coding those using C++ templates, which can produce very fast code. Other issues are important as well. Categorical data is very common in statistical computations, and they are handled in ways that save memory, increase speed, and increase computational precision as well. 5

It is often the case in statistical models that the same values are required in different parts of the computation. RevoScaleR has a sophisticated algorithm for pre-analyzing models to detect such duplication, so that the number of computations can be minimized. Multiple models can be analyzed jointly. This algorithm can also detect collinearities in models, which can lead to wasted computations or even computational failures, and can remove them prior to doing any computations. Conclusion RevoScaleR is a library included in Revolution R Enterprise that provides extremely fast statistical analysis on terabyte-class data sets, without needing specialized hardware. Using only a commodity multi-processor computer with modest amounts of RAM, data processing and predictive modeling techniques can easily be performed on data sets with hundreds of millions of rows and hundreds of variables, at speeds suitable for interactive processing. Extending the system to a small cluster of similar computers commensurately reduces processing time. These achievements are the result of the design of the RevoScaleR platform, constructed from the ground up for speed and scalability. Specifically: Efficient storage of data on local disk, in the high-performance XDF file format optimized for block-reads of data; A high-performance strategy for streaming data from disk to memory, optimizing throughput by dedicating one core to I/O while remaining cores process buffered data; Optimized data formats for storing data in-memory; Parallelized algorithms that exploit multiple cores to perform analytic processing on chunks of data held temporarily in-memory; The ability to exploit the processing power of multiple nodes in a cluster, to further reduce processing times; and An architectural platform to implement parallel, streaming algorithms that efficiently combine the partial results from optimized algorithms running multiple cores and multiple machines, to provide fast statistical data analyses on extremely large data sets. The RevoScaleR library is included with Revolution R Enterprise, available for Windows and Linux systems from Revolution Analytics. For more information, please contact Revolution Analytics at 1-855-GET-REVO (+1 650 646 9545) or at www.revolutionanalytics.com. 6

About Revolution Analytics Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. Led by predictive analytics pioneer and SPSS co-founder Norman Nie, the company brings high performance, productivity, and enterprise readiness to open source R, the most powerful statistics language in the world. In the last 10 years, R has exploded in popularity and functionality and has emerged as the data scientists tool of choice. Today R is used by over 2 million analysts worldwide in academia and at cutting-edge analytics-driven companies such as Google, Facebook, and LinkedIn. To equip R for the demands and requirements of all business environments, Revolution R Enterprise builds on open source R with innovations in big data analysis, integration and user experience. The company s flagship Revolution R product is available both as a workstation and server-based offering. Revolution R Enterprise Server is designed to scale and meet the mission-critical production needs of large organizations such as Merck, Bank of America and Mu Sigma, while Revolution R Workstation offers productivity and development tools for individuals and small teams that need to build applications and analyze data. Revolution Analytics is committed to fostering the growth of the R community. The company sponsors the Inside-R.org community site, local users groups worldwide, and offers free licenses of Revolution R Enterprise to everyone in academia to broaden adoption by the next generation of data scientists. Revolution Analytics is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital. Please visit us at www.revolutionanalytics.com 7