RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution R Enterprise, is designed from the ground up to be fast and scalable. Consideration has been give to all of the components that are involved in performing large-scale statistical analysis. These include data storage, usage of a computing infrastructure s resources (RAM, CPUs, cores, and computers) and the algorithms themselves. Its extreme speed and scalability are the result of careful, innovative engineering at every stage. This white paper describes the design and implementation considerations that are the foundation of the high-performance Big Data capabilities of Revolution R Enterprise. Executive Summary Analytics-driven breakthroughs in every field from healthcare to financial services have put demand for advanced analytics front and center for large and small organizations. As in any IT deployment, IT leaders supporting analytics environments have been challenged by tradeoffs among cost, performance and functionality. These tradeoffs are becoming more problematic due to exploding data volumes and the increasing numbers of people who recognize the potential impact that advanced analytics could have and are requesting analytics solutions that exceed the capabilities of existing tools. How can IT create an analytics infrastructure that will grow with the organization s needs? For the past several decades, the rising tide of technology especially the increasing speed of single processors has allowed the same data analysis code from legacy analytics software vendors to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM and legacy code can t keep up. To allow analytics to realize its potential for organizational improvements and handle very large and growing data sets, IT leaders need scalable data analysis software that is able to run on newer hardware paradigms, specifically using multiple cores, multiple hard drives, and multiple computers. The data analysis software needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds. 1

Revolution Analytics offers enterprise-grade, terabyte-class software based on the Open Source project R. This white paper discusses the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR, specifically exploring: Storing data Reading and writing of chunks of data Handling data in memory Using multiple cores on single computers Please share this white paper with the people on your team who are responsible for collecting, storing, managing, analyzing and extracting value from data. 1. Storing Data One of the keys to being scalable is the ability to process more data than can fit into memory at one time. This essentially equates to being able to work with chunks of data instead of requiring the entire dataset to be resident in memory at once. In the context of RevoScaleR, chunks are defined as sequential blocks of rows for a given selection of columns. Although RevoScaleR can process data from a wide variety of sources, it has its own highly optimized file format (the "XDF" format) that is especially suitable for chunking. Data in an XDF file can be accessed rapidly by row or by column. In addition, blocks of contiguous rows for selected columns can be read sequentially, rather than randomly. Sequential reads can be tens to hundreds of thousands of times faster than random reads. Furthermore, in an XDF file the time it takes to read a block of rows for a variable is essentially independent of the total number of variables and rows in the file. This means that even in terabyte-sized files, only the data for the actual variables required for an analysis needs to be read and processed, and this may only be a few hundred megabytes. The time it takes to do that is essentially the same as if only that data were stored in the file; storing the additional unused data does not add to the processing time. Data in an XDF file is stored in the same binary format that is used in memory, so no conversion is required when it is brought into memory. In order to minimize wasted space, it can also be stored in an appropriately-sized way. For instance, values that have no more than 256 values can be stored in a byte per number, rather than in 8 bytes as is the case with some data analysis programs. Floating point values with a precision of less than 6 or 7 decimal digits, which is commonly the case, can be stored in 4 bytes per number, not 8. New variables and new rows can be added to the file without having to rewrite the entire file. Thus, the cost of creating new variables and of adding more observations is greatly reduced. 2

2. Reading Data When data is read in "chunks," the optimal chunk size depends upon a variety of factors, such as the speed of the disk, the speed of RAM, the number and speed of cores, and the types of computations being done. RevoScaleR allows the size of chunks to vary depending upon conditions. A bottleneck for data processing is data I/O: reading the data from disk. RevoScaleR dedicates one core to reading data from disk, to avoid disk contention and optimize bandwidth for data/io. Meanwhile, the remaining cores are assigned to processing the chunk of data read into memory from the previous read. Of course, when it is possible to fit all data into memory, RevoScaleR allows that. It then assigns all cores to process that data. 3. Handling Data in Memory As on disk, use of the appropriate-sized data type in memory reduces the space required and also reduces the time it takes to move the data around in memory. In RevoScaleR, the amount of data conversion and copying is minimized, to save time and speed. In almost all other data-oriented programs, before an array of integers and an array of double precision floating point numbers are added together, the array of integers is first converted and copied into an array of doubles. This takes time and space. In RevoScaleR, that is almost never necessary, regardless of the type of operation and the sizes of the data types. No conversion or copying is done until the values are actually loaded into the CPU. 4. Use of Multiple Cores on a Single Computer Nearly all computations that involve data in RevoScaleR are automatically "threaded" that is they use multiple cores on a machine when they are available. This is done efficiently by minimizing the overhead of transferring the computations to multiple threads, by minimizing the amount of data that must be copied, by doing as much work as possible on each thread to amortize the cost of initializing the computations, and by minimizing inter-thread communication and synchronization. Feeding large chunks of data to each of the multiple cores is important for efficiency. For analytic routines such as descriptive statistics, crosstabs, linear regression, logistic regression, and K- means clustering (in which several variables are typically used) a large chunk of observations perhaps millions for all of the variables is read into memory by one core. Simultaneously, the data chunk from the previous read is "virtually" split among the remaining cores for the required processing. The code doing the processing on each core (thread) only needs to know what its assigned task is, and no inter-thread communication and synchronization is needed. As a simple example, consider computing the mean of several variables. Millions of observations of each of those variables might be read by the I/O thread and then each of the other threads is given a proportionate share of the observations. Each computational thread just needs to compute and store the sum of each of the variables for its share of the observations, and to 3

record how many total observations it used. To get the means for the entire data set, the partial sums and partial observation counts are aggregated, and the grand sums are divided by the total number of observations. Figure 1 RevoScaleR on Single Computer A RevoScaleR algorithm is provided a data source as input The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0). Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory When all of the data is processed a master results object is created from the intermediate results objects 5. Use of Multiple Computers A key to efficiently using multiple computers is to minimize the amount of information including data that must be communicated among the computers. In RevoScaleR, one of the computers (the master node) controls the computations on all of the other computers. It first sends a message to each compute node telling it where to find the data to use, and what types of computations to do. On each computer, multiple cores are used as described above, to maximize the efficiency of the node. The intermediate results from all cores are aggregated on that node, and only that information is sent back to the master node. The master node monitors the status of the compute nodes, aggregates the overall results sent back by those nodes, and then processes those results to get overall estimates. The final processing often involves computeintensive operations such as solving large sets of equations. RevoScaleR allows several options for getting data to the cores on each node, including reading data from a common data server, but it is generally most efficient to have the portion of data needed by each node stored locally. For iterative algorithms that require many passes through the data, such as logistic regression and K-means clustering, the master node controls the iterations. This is done by repeating the steps 4

described above: each iteration is initialized by a message from the master node, which aggregates the results that come back, computes the next set of estimated parameters, and decides whether the algorithm has converged. If not, another iteration is started. Figure 2 RevoScaleR on Multiple Computers Portions of the data source are made available to each compute node RevoScaleR on the master node assigns a task to each compute node, and the sleeping instance of RevoScaleR on the compute node wakes up. RevoScaleR on each compute node independently processes its data, and returns it s intermediate results back to RevoScaleR on the master node The RevoScaleR on the master node aggregates all of the intermediate results from each compute node and produces the final result 6. Efficient Parallelization of Statistical and Data Mining Algorithms RevoScaleR is built upon a platform designed to automatically and efficiently parallelize "external memory" algorithms. This is the class of algorithms that do not require all data to be in memory at one time, and such algorithms are available for a wide range of statistical and data mining routines. The way in which these algorithms are automatically parallelized is such that, in general, the fastest algorithms per core are also the fastest when parallelized. (This happy situation is not the case for some other types of parallel algorithms). Since the burden of worrying about parallelization is removed from the engineers implementing these algorithms, they can focus on getting optimal speed on each core. This involves several things. Most obviously, it involves using fast algorithms, and carefully coding those using C++ templates, which can produce very fast code. Other issues are important as well. Categorical data is very common in statistical computations, and they are handled in ways that save memory, increase speed, and increase computational precision as well. 5

It is often the case in statistical models that the same values are required in different parts of the computation. RevoScaleR has a sophisticated algorithm for pre-analyzing models to detect such duplication, so that the number of computations can be minimized. Multiple models can be analyzed jointly. This algorithm can also detect collinearities in models, which can lead to wasted computations or even computational failures, and can remove them prior to doing any computations. Conclusion RevoScaleR is a library included in Revolution R Enterprise that provides extremely fast statistical analysis on terabyte-class data sets, without needing specialized hardware. Using only a commodity multi-processor computer with modest amounts of RAM, data processing and predictive modeling techniques can easily be performed on data sets with hundreds of millions of rows and hundreds of variables, at speeds suitable for interactive processing. Extending the system to a small cluster of similar computers commensurately reduces processing time. These achievements are the result of the design of the RevoScaleR platform, constructed from the ground up for speed and scalability. Specifically: Efficient storage of data on local disk, in the high-performance XDF file format optimized for block-reads of data; A high-performance strategy for streaming data from disk to memory, optimizing throughput by dedicating one core to I/O while remaining cores process buffered data; Optimized data formats for storing data in-memory; Parallelized algorithms that exploit multiple cores to perform analytic processing on chunks of data held temporarily in-memory; The ability to exploit the processing power of multiple nodes in a cluster, to further reduce processing times; and An architectural platform to implement parallel, streaming algorithms that efficiently combine the partial results from optimized algorithms running multiple cores and multiple machines, to provide fast statistical data analyses on extremely large data sets. The RevoScaleR library is included with Revolution R Enterprise, available for Windows and Linux systems from Revolution Analytics. For more information, please contact Revolution Analytics at 1-855-GET-REVO (+1 650 646 9545) or at www.revolutionanalytics.com. 6

About Revolution Analytics Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. Led by predictive analytics pioneer and SPSS co-founder Norman Nie, the company brings high performance, productivity, and enterprise readiness to open source R, the most powerful statistics language in the world. In the last 10 years, R has exploded in popularity and functionality and has emerged as the data scientists tool of choice. Today R is used by over 2 million analysts worldwide in academia and at cutting-edge analytics-driven companies such as Google, Facebook, and LinkedIn. To equip R for the demands and requirements of all business environments, Revolution R Enterprise builds on open source R with innovations in big data analysis, integration and user experience. The company s flagship Revolution R product is available both as a workstation and server-based offering. Revolution R Enterprise Server is designed to scale and meet the mission-critical production needs of large organizations such as Merck, Bank of America and Mu Sigma, while Revolution R Workstation offers productivity and development tools for individuals and small teams that need to build applications and analyze data. Revolution Analytics is committed to fostering the growth of the R community. The company sponsors the Inside-R.org community site, local users groups worldwide, and offers free licenses of Revolution R Enterprise to everyone in academia to broaden adoption by the next generation of data scientists. Revolution Analytics is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital. Please visit us at www.revolutionanalytics.com 7