Big R: Large-scale Analytics on Hadoop using R

Transcription

1 2014 IEEE International Congress on Big Data Big R: Large-scale Analytics on Hadoop using R Oscar D. Lara, Weiqiang Zhuang, and Adarsh Pannu IBM Silicon Valley Laboratory San Jose, CA, USA {odlaraye, wzhuang, [email protected] Abstract As the volume of available data continues to rapidly grow from a variety of sources, scalable and performant analytics solutions have become an essential tool to enhance business productivity and revenue. Existing data analysis environments, such as R, are constrained by the size of the main memory and cannot scale in many applications. This paper introduces Big R, a new platform which enables accessing, manipulating, analyzing, and visualizing data residing on a Hadoop cluster from the R user interface. Big R is inspired by R semantics and overloads a number of R primitives to support big data. Hence, users will be able to quickly prototype big data analytics routines without the need of learning a new programming paradigm. The current Big R implementation works on two main fronts: (1) data exploration, which enables R as a query language for Hadoop and (2) partitioned execution, allowing the execution of any R function on smaller pieces of a large dataset across the nodes in the cluster. Keywords-Big data, MapReduce, machine learning. I. INTRODUCTION In the past decade, there has been an exceptional increase in the amount of available data, stemming from social networks, mobile sensing data, and multimedia, among other sources. In 2009, the growth rate of the digital universe reached 62%, resulting in 1.2 zettabytes of data (i.e., 1.2 billion terabytes!) [1]. It is estimated that, by 2020, this amount will be 44 times larger, whereas 80% of the data universe qualifies as unstructured. Therefore, the most essential goal is now the extraction of useful information from these massive amounts of data in an efficient, convenient, and user-friendly manner. The applications of the emerging area of big data analytics are countless: a financial corporation, for instance, could be interested in automatically detecting fraudulent transactions; likewise, insurance companies are nowadays focused on analyzing high volumes of telemetry data to determine whether the driver s behavior might be prone to traffic accidents; airlines are also willing to analyze which factors may cause flight delays, in order to optimize their scheduling policies. Hadoop [2], the most widely used implementation of the MapReduce [3] paradigm, came into picture in 2005 as an effective alternative to tackle big data problems. When the amount of data is extremely large, Hadoop offers a cost-effective solution by providing high performance computations on a number of inexpensive machines working parallelly. However, the power of Hadoop has not been fully realized in the analytics domain. Current projects, such as Apache Mahout, offer a very limited set of statistical tools and machine learning algorithms. R [4], on the other hand, is one of the most complete and broadly used environments for data analysis and mining. There are at least two million R users over the world, many of this in the academic and corporative sectors [5]. R incorporates a huge arsenal of data mining algorithms, as well as powerful mechanisms for visualization and data manipulation, in more than five thousand available packages [4]. Nevertheless, R also has a big limitation: it relies on in-main-memory data, which makes it unsuitable for large datasets (i.e., datasets which cannot fit in a single machine s main memory). As an alternative, data scientists usually apply sampling or partitioning to handle large datasets in R. For such purpose, they usually need to write scripts in other languages such as Perl, Java, or C to sample or divide the data into pieces that are small enough to fit in memory. This process could be extremely tedious, slow, and error prone. Besides, sampling and partitioning often reduce models accuracy by discarding potentially valuable data. In this paper, we introduce Big R as a new platform for exploiting the best of both worlds: Hadoop and R. Big R extends a variety of R primitives, classes, operators, and functions to support big data, thereby making R scalable while exploiting the parallelization power inherent to Hadoop. Big R allows users to access, manipulate, visualize, and analyze data on HDFS (in flat files, Hive, or HBase) in a transparent manner. In fact, Big R primitives and semantics are very similar (even identical, in many cases) to R s, facilitating the creation of new routines without the need of learning a new language or paradigm. Big R is intended to work on three main fronts: (1) data exploration, which enables R as a query language for Hadoop; (2) partitioned execution, allowing the execution of any R function on smaller pieces of a large dataset across the nodes in the cluster; and (3) large scale machine learning, opening the possibilities to train and evaluate data mining algorithms on arbitrarily big data. The first two fronts are available in our current implementation, while the third one is still under development. The rest of the paper is organized as follows: Section II gives an overview of the current work on integrating R and Hadoop. Section III describes the design and main features of Big R. Section IV illustrates the most relevant use cases and exposes /14 $ IEEE DOI /BigData.Congress

2 part of the functional specification of the system. Section V covers some performance considerations and finally, Section VI concludes the paper and summarizes the lessons learned. II. RELATED WORK There have been a number of efforts to integrate R and Hadoop. But perhaps, the most relevant ones are RHadoop [6] and RHIPE [7]. These two solutions, nonetheless, only provide an API to invoke MapReduce from R, requiring users to write code for both mappers and reducers using R. This brings about usability challenges, especially for data scientists who are not expected to have any experience or interest on the internal parallelization of their code, but are rather focused on analyzing their data. In contrast, Big R features a more user-friendly interface by mimicking R primitives, operators, and functions, thereby hiding the underlying complexity of MapReduce programming to illustrate this, we will take a look at a few examples shortly. Furthermore, Big R leverages the entire spectrum of R capabilities by (1) partitioning the data in R-able pieces and (2) parallelizing the execution in multiple R instances running on a Hadoop cluster. Although RHIPE also follows a partition-based approach, it does not embed grouping capabilities out of the box, which Big R does through groupapply() (see Section IV-C). Throughout this paper, we will use the well known airline dataset [8] from the United States Department of Transportation. This dataset contains more than 20 years of flight arrival/departure information and has been extensively analyzed in research studies and benchmarks. Now, suppose we are interested in computing the mean departure delay for each airline on a monthly basis. In R Hadoop, such a simple query would require the implementation shown in Figure 1. csvtextinputformat = function(line) keyval(null, unlist(strsplit(line, "\\,"))) deptdelay = function (input, output) { mapreduce(input = input, output = output, textinputformat = csvtextinputformat, map = function(k, fields) { # Skip header lines and bad records. if (!(identical(fields[[1]], "Year")) & length(fields) == 29) { deptdelay <- fields[[16]] # Skip records where departure delay is "NA" if (!(identical(deptdelay, "NA"))) { # field[9] is carrier, field[1] is year, field[2] is month keyval(c(fields[[9]], fields[[1]], fields[[2]]), deptdelay), reduce = function(keysplit, vv) { keyval(keysplit[[2]], c(keysplit[[3]], length(vv), keysplit[[1]], mean(as.numeric(vv)))) ) from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month")) Figure 1. RHadoop code for calculating the mean departure delay for each airline in each month [6]. An RHIPE user, on the other hand, will need to write the code fragment shown in Figure 2 to execute the same query [7]. In turn, using Big R, this query can be accomplished with only two lines of code: rhinit(true, TRUE) # Output from map is: # "CARRIER YEAR MONTH \t DEPARTURE_DELAY" map <- expression({ # For each input record, parse out required fields and output new record: extractdeptdelays = function(line) { fields <- unlist(strsplit(line, "\\,")) # Skip header lines and bad records: if (!(identical(fields[[1]], "Year")) & length(fields) == 29) { deptdelay <- fields[[16]] # Skip records where departure dalay is "NA": if (!(identical(deptdelay, "NA"))) { # field[9] is carrier, field[1] is year, field[2] is month: rhcollect(paste(fields[[9]], " ", fields[[1]], " ", fields[[2]], sep=""), deptdelay) # Process each record in map input: lapply(map.values, extractdeptdelays) ) # Output from reduce is: # YEAR \t MONTH \t RECORD_COUNT \t AIRLINE \t AVG_DEPT_DELAY reduce <- expression( pre = { delays <- numeric(0), reduce = { # Depending on size of input, reduce will get called multiple times # for each key, so accumulate intermediate values in delays vector: delays <- c(delays, as.numeric(reduce.values)), post = { # Process all the intermediate values for key: keysplit <- unlist(strsplit(reduce.key, "\\ ")) count <- length(delays) avg <- mean(delays) rhcollect(keysplit[[2]], paste(keysplit[[3]], count, keysplit[[1]], avg, sep="\t")) ) inputpath <- "/data/airline/" outputpath <- "/dept-delay-month" # Create job object: z <- rhmr(map=map, reduce=reduce, ifolder=inputpath, ofolder=outputpath, inout=c( text, text ), jobname= Avg Departure Delay By Month, mapred=list(mapred.reduce.tasks=2)) # Run it: rhex(z) Figure 2. RHIPE code for calculating the mean departure delay for each airline in each month [7]. > air <- bigr.frame(datapath = "airline.csv", datasource = DEL, na.string="na") > summary(mean(depdelay) ~ UniqueCarrier + Year + Month, dataset = air) The reader may notice that Big R tremendously enhances productivity in the big data analytics context. The summary() function provided by Big R supports aggregate operations (e.g., mean()) and grouping columns (e.g., UniqueCarrier, Year, and Month). Internally, Big R parallelizes the execution on Hadoop by means of IBM BigInsights [9] and the result will be returned as a data.frame (R s primary data structure to handle tabular data). We will examine summary() as well as other functions in the upcoming chapters. III. AN OVERVIEW OF BIG R Big R is an R package, part of the IBM InfoSphere BigInsights [9] platform IBM s distribution of Hadoop. Big R overloads a number of R functions and operators to be able to access big data on a Hadoop cluster. It is worth highlighting that Big R users only need to write 100% R code, not worrying about the internal MapReduce parallelization mechanism. In addition, Big R is self contained, enabling users to access, manipulate, visualize, and analyze big data, all from within an R IDE such as RStudio. Figure 3 displays

3 the high level Big R architecture. Big R offers an interface to communicate with BigInsights for accessing data residing on HDFS (i.e., flat files, Hive, or HBase). With Big R, the R language becomes a powerful big data query language, able to run data exploration and analytics functions with similar or even identical semantics to R s on MapReduce. Furthermore, users can exploit the partitioned execution capabilities to leverage any R function and run it on smaller chunks of the data parallelly on the cluster. Finally, our framework allows for an easy integration of large-scale machine learning algorithms, which can be used to train and evaluate models on the entire data. Figure 3. Big R architecture A. Classes Table I summarizes Big R core classes. Big R mimics some of R s data abstraction classes to provide access to big data sources. Today, R users can easily load a small (i.e., able to fit in main memory) tabular dataset into an object of class data.frame. Big R, analogously, offers bigr.frame as the scalable version of data.frame. Likewise, bigr.vector and bigr.list are inspired by R s vector and list types. bigr.vectors are usually the result of projecting a column on a bigr.frame, whereas bigr.lists are fundamental for partitioned execution. Finally, class bigr.connection enables accessing data on BigInsights from R via JDBC. Class bigr.frame bigr.vector bigr.list bigr.connection Table I BIG R CORE CLASSES Description A proxy to a tabular dataset on a Hadoop data source. Big R s version of data.frame. A proxy to a uni-dimensional dataset on a Hadoop data source. Big R s version of vector. A proxy to a collection of serialized R objects that were generated as a result of partitioned execution. A handler to connect R with IBM InfoSphere BigInsights. Since large datasets cannot be fully loaded in R due to main memory limitations, Big R classes are proxies to big data sources, giving users the feeling of R s interactive data access. Functions on these classes can be used to pull data to the client, yet this is only recommended for bringing in small slices of data, which are typically summaries or aggregations. A more powerful method of operation is to push an R function right to the data themselves by means of partitioned execution. B. Operators Big R overloads arithmetic, logical, and data manipulation operators to be applied to bigr.frames, bigr.vectors, and bigr.lists. Table II summarizes Big R s available operators. Arithmetic, relational, and logical operators are applicable to two bigr.vectors or to a bigr.vector and a primitive R data type (e.g., integer, numeric, logical, orcharacter). Operator [] allows for filtering elements on bigr.vectors or bigr.frames given relational or logical conditions. Operator $ projects a column in a bigr.frame or an element in a bigr.list. Finally, operator [,] combines both filtering and projection capabilities for bigr.frames. In the next section, we will provide examples on how to employ these operators in a big data context. Table II BIG R OPERATORS Operators Description + - * / %% ^ Arithmetic operators applicable among bigr.vectors or between bigr.vectors and R primitive data types. [] [,] $ $<- Projection / filtering operators applicable to bigr.frames. [] $ Filtering/indexing operators for bigr.lists and bigr.vectors. &! Logical operators applicable to bigr.vectors. == > <!= <= >= Relational operators applicable among bigr.vectors %in% or between bigr.vectors and R primitive data types. C. From R to MapReduce Turning every R construct directly into MapReduce could be a very challenging task. Thus, Big R relies on Jaql [10], a scripting language for querying and manipulating large datasets on Hadoop. Jaql s most appealing feature is that it internally translates a given query into MapReduce jobs in a transparent fashion. Big R takes advantage of this abstraction, overloading R s core operators and functions through R generic methods to generate Jaql queries. Such queries are executed on the BigInsights server via JDBC. Jaql is a functional language and it features a pipe (->) operator, which turns out to be very useful to stream the output of a function as the input of another one. Following this paradigm, every Big R operation generates a Jaql function which will be appended to the pipe chain. Let us examine a few examples on how to manipulate big data with Jaql. First, in order to read a CSV file on HDFS, Jaql offers the read() and del() functions: read(del(location= dataset.csv ));

4 where dataset.csv is a file stored on HDFS. Now, say we need to project the first three columns and select the rows such that the fourth column is greater than 10: read(del(location= file5.txt )) -> filter ($[3] >= 10) -> transform [$[0], $[1], $[2]]; The transform expression allows to project columns by specifying column identifiers, while filter selects rows that fulfill the given condition(s). In plain R, projection and selection in a data.frame are handled by operator [,]. Big R overloads this operator to work with bigr.frames, generating the corresponding Jaql query. Jaql incorporates a number of built-in data manipulation constructs which naturally map to many R primitives (i.e., selection/projection, sorting, arithmetics, etc.). In all other cases, we wrote our own Jaql functions to extend R capabilities to big data. The reader might refer to [9], [10] for more details on Jaql. IV. BIG R USE CASES As described earlier, Big R features two implemented use cases: data exploration and partitioned execution. Next, let us explore these scenarios in greater detail. The first step to start working with Big R is to establish a connection to BigInsights. This can be easily accomplished with method bigr.connect(), indicating the address of one of the cluster nodes along with the authentication information. bigr.connect() creates a JDBC connection to BigInsights, enabling implicit execution of Jaql queries and updates from the R client: > library(bigr) > bigr.connect(host= , user= biadmin, password= biadmin ) A. Data exploration Big R enables the use of R as a query language for Hadoop, providing methods to filter, project, sort, join, and recode, just to mention a few. Additionally, Big R has sampling capabilities and offers built-in analytics functions as part of the data exploration. 1) Accessing big data: Once a connection has been made, a bigr.frame (i.e., Big R s version of data.frame) can be created as follows: > air <- bigr.frame(datasource= DEL, datapath="/user/biadmin/airline.csv", header=true) where air is an instance of class bigr.frame, pointing to a delimited HDFS file airline.csv located on /user/biadmin. The file has headers (parameter header was set to TRUE), meaning that column names will be automatically picked up from the file. At this point, no data has been loaded in the R client, but only a proxy to the airline dataset has been initialized. 2) Data transformation functions: Table III compiles some of the most important Big R s data transformation functions. Almost all of them share the exact R semantics, extending their capabilities to large scale datasets. Table III DATA TRANSFORMATION FUNCTIONS Functions head(), tail() str(), show(), print() colnames(), coltypes(), colnames<-, coltypes<attach(), with() sort(), merge() bigr.persist() ifelse() Description First or last k elements of a bigr.frame or bigr.vector. Visualize a bigr.frame, bigr.vector, orbigr.list. Return/assign column names and types for a bigr.frame. Direct access to the columns of a bigr.frame. Sort a bigr.frame or bigr.vector. Join two bigr.frames. Export a bigr.frame, bigr.vector, or bigr.list to a persistent data source. Recode a bigr.vector. # Allow accessing air s columns directly > attach(air) # Filter non-canceled delayed flights for American and # Hawaiian Airlines. Project five columns. > airsubset <- air[(canceled == 0) & (DepDelay >= 10) & (UniqueCarrier %in% c( AA, HA )), c("uniquecarrier", "Origin", "Dest", "DepDelay", "ArrDelay")] # Calculate the average between departure # and arrival delay. > airsubset$avgdelay <- (airsubset$arrdelay + airsubset$depdelay) / 2 # Recode average delay as High, Medium, or Low. > airsubset$delay <- ifelse(airsubset$avgdelay > 30, High, ifelse(airsubset$avgdelay < 20, Low, Medium )) # Sort by airline (ascending) and origin (descending) > report <- sort(airsubset, list(airsubset$uniquecarrier,airsubset$origin), decreasing=c(false, TRUE)) # Show the structure of report. > str(report) bigr.frame : 7 variables: $ UniqueCarrier: chr "AA" "AA" "AA" "AA" "AA" "AA"... $ Origin : chr "XNA" "XNA" "XNA" "TVL" "TUS" "TUS"... $ Dest : chr "ORD" "DFW" "DFW" "SJC" "ORD" "DFW"... $ DepDelay : int $ ArrDelay : int 31 NA $ AvgDelay : num 37.5 NA $ Delay : chr "High" "Medium" "Low" "Medium"... Figure 4. Data transformation functions in action. Figure 4 exposes some of Big R s data transformation functions. Notice that projection/filtering operators [,] and $ work for bigr.frames exactly as they do for data.frames! Likewise, relational, logical, and arithmetic operators work with bigr.vectors just as they do with vectors. sort() function, however, does introduce new semantics as R s sort() is only intended for vectors. Big R s sort() allows sorting a bigr.frame by specifying the list of columns to order by and whether they should be sorted in increasing or decreasing order. Finally, the result can be displayed just as in R by means of function str(). 3) Sampling: Operations on very large datasets are usually processed offline, since they tend to take very long to complete. Hence, data scientists first evaluate different algorithms on small data samples before actually modeling

5 the entire dataset. In this direction, Big R offers sampling capabilities to enable rapid prototyping of big data analytics routines. Sampling methods take as input a bigr.frame and also return the corresponding sample(s) as a bigr.frame or a list of bigr.frames. Three sampling methods are supported for a dataset with n rows: Proportional sampling: for a specific proportion 0 < p<1, this method returns a sample with approximately np rows. For this purpose, a uniform random number 0 ε 1 is generated for each row and then, the rows such that ε<pwill be returned. This approach can be computed in O(n) after a single pass through the data. When n (which is the case of big data), the number of rows in the sample will approach to np. Fixed-size sample: for a specific size 0 < k < n, this method returns a sample of exactly k rows with confidence 1 α. For this purpose, a coin is tossed on each row with a fixed probability of success q. In order to guarantee at least k rows, the value of q is calculated from the binomial distribution confidence interval, using the normal approximation: 1 q = p + z α 1/2 p(1 p) (1) n Then, the top k rows will be picked. Notice this algorithm tends to generate more rows than needed to guarantee at least k of them. Therefore, first rows will have slightly more chances of being chosen. Such bias becomes smaller as the data size n. Partitioned sampling: given a vector of proportions r p =[p i ] of size r such that p i =1and p i > 0, i=1 this method divides the original dataset into r disjoint samples with sizes approximately np rows. As n, the number of rows in the partitions approaches to np. Some sampling examples are presented below: # A random 1% sample of the data > sample1 <- bigr.sample(air, perc=0.01) # A random partition of the dataset into 70% and 30% > samples <- bigr.sample(air, perc=c(0.7, 0.3)) # A random sample with exactly 100 rows > sample2 <- bigr.sample(air, nsamples=100) sample1 and sample2 are bigr.frames, whereas samples is a list of two non-overlapping bigr.frames with approximately 70% and 30% of the data. By default, samples are stored as temporary files on HDFS. These files will be cleaned up by R s garbage collector once the objects are no longer referenced. 4) Analytics functions: Table IV shows Big R s builtin analytics functions applicable to bigr.vectors. Univariate and bivariate statistics functions can be used to compute mean, standard deviation, quartiles, correlation, and covariance. Unique values and distinct counts can also be calculated with the table() function. Additional arithmetic operations such as absolute value, logarithms, power, and square root, among others are also supported by Big R. Functions mean(), sd(), var(), cov(), cor(), quartiles() summary(), min(), max(), range(), sum(), count(), mean() unique(), table() abs(), sign(), log(), pow(), sqrt() Table IV ANALYTICS FUNCTIONS Description Univariate / bivariate statistics. Aggregate functions. Could be applied on the entire data or on a group basis via summary(). Distinct values and counts for each value. Arithmetic functions. Furthermore, Big R features the summary() function, which allows to calculate aggregate functions on the given columns on a group basis. Next section, which addresses data visualization, heavily relies on summary() to calculate statistics which will be the input of R s visualization functions. It is worth highlighting that these functions are only a small subset of Big R s analytics capabilities. Virtually any R function can be pushed to the server using partitioned execution (see Section IV-C). B. Data visualization Figure 5. Conditioned histogram and boxplot calculated by Big R and rendered using the ggplot2 package. Big R supports two built-in big data plots: histograms and boxplots. These can be applied to an entire dataset (i.e., a bigr.vector), as well as in a conditioned fashion using R s formula notation. In the latter case, one or more grouping columns may be specified. A conditioned histogram plot consists of a set of histograms (i.e., one per group) displayed on a grid. Figure 5 shows the distribution of the arrival delay (i.e., ArrDelay) for six different airlines. The number of

6 bins has been set to 20. Observe that American Airlines (AA) and U.S. Airlines (US) have a much higher proportion of delayed flights. This is expected as they manage a higher volume of flights on busier airports. Aloha and Hawaiian Airlines (AQ and HA), on the other hand, exhibit rather low arrival delays. In addition, the plot indicates that positive region of the arrival delay follows an exponential distribution, which is an expected result as well. In Big R, plotting conditioned histograms turns out to be very simple: # Remove outliers and select target airlines > airfiltered <- air[air$arrdelay < 60 & air$uniquecarrier %in% c("aq", "AA", "DH", "HA", "UA", "US")] # Plot the conditioned histogram chart > bigr.histogram(airfiltered$arrdelay ~ airfiltered$uniquecarrier, nbins=20) First, airfiltered is created as a subset of air, after discarding outliers (i.e., when the arrival delay is more than one hour) and filtering six specific airlines. Then, the function bigr.histogram() is invoked to plot the histogram. Its first parameter is a formula, in which the left side indicates the target column whereas the right side specifies the grouping column(s). Additionally, the number of bins can be set by means of argument nbins. If no grouping columns are specified, a single histogram will be rendered for the entire dataset. Conditioned box plots are also included in Big R, where each box (i.e., range, quartiles, and mean) corresponds to a group. Using the same airfiltered dataset, a boxplot of the arrival delay for each airline is shown in Figure 5. The plot was generated with method bigr.boxplot(), specifying the target column and the grouping column as a formula: # Plot the conditioned histogram chart > bigr.boxplot(airfiltered$arrdelay ~ airfiltered$uniquecarrier) Although Big R only includes histograms and boxplots out of the box, the user can leverage analytics functions to build many other insightful visualizations. Function summary() plays a crucial role here, allowing to compute aggregated statistics which will be the input of a number of rendering functions available in R s extensive visualization library. As an example, Figure 6 shows the daily flight volume in 2000 through Observe that dates such as the day before Thanksgiving (Nov 22, 2001) and Independence day (4th of July, 2001), among others, have a higher concentration of flights. Moreover, the reader can easily spot the day with almost no flights: September 12, 2001, the day after the 9/11 attacks. The data required to generate this plot was calculated using a rather simple Big R query: > summary(count(.) ~ Month + DayofMonth + Year, data=air[air$year %in% c(2000, 2001, 2002) & air$canceled == 0, ]) This query counts the number of non-canceled flights in 2000, 2001, and 2002, grouping by day (i.e., DayofMonth), month (i.e., Month), and year (i.e., Year). summary() returns a data.frame with the daily flight volumes, so the entire raw data never reaches the R client. Some additional data pre-processing needs to be done to construct the heat map shown in Figure 6. Figure 6. Flight volume in early 2000 s Figure 7 displays another example of data visualizations powered by Big R. In this case, the busiest flight routes are shown on a map. The Big R query to make this visualization possible is as follows: summary(count(.) ~ Origin + Dest, data=airline) Observe this is nothing but counting the flights while grouping by origin and destination. Notice that larger cities such as New York, Houston, Atlanta, San Francisco, Chicago, among others, exhibit the largest flows of flights. Then, the airports need to be geo-coded and some additional pre-processing is required to display the routes on the map. The reader may refer to packages map and geosphere for more details on the generation of this plot. Figure 7. Busiest flight routes in the US. C. Partitioned execution Formerly, we have shown Big R s features as a big data query language and how it can be used to generate insightful visualizations on large datasets. But perhaps the most powerful use case of Big R is partitioned execution. Here, the entire spectrum of R analytics and modeling packages can be shipped to the cluster. Big R s partitioned execution functions are inspired by R s *apply() family and they are able to execute a given function on smaller pieces of

7 a dataset according to certain split criteria. Partitions may be arranged (1) on a group basis (i.e., groupapply()), indicating one or more grouping columns, or (2) on equallysized chunks (i.e., rowapply()), given the number of rows in each partition. In order to illustrate Big R s partitioned execution capabilities, let us build regression tree models for the arrival delay of each airline. First, we create both training and testing sets by means of the bigr.sample() function: > split <- bigr.sample(air, perc=c(0.3, 0.7)) > train <- split[[1]] > test <- split[[2]] 1) Training models: In order to train the models, we will use package rpart, which includes the recursive partitioning and regression tree algorithm. groupapply() requires three arguments: a dataset (i.e., a bigr.frame), a set of grouping columns (e.g., UniqueCarrier), and an R function to invoke. In our case, such function, called buildmodel(), will generate a regression tree on each partition as follows: buildmodel <- function(df) { library(rpart) predcols <- c( ArrDelay, DepDelay, DepTime, CRSArrTime, Distance ) model <- rpart(arrdelay ~., df[,predcols]) return(model) where df is the training set, predcols contains the attributes that will be part of the model, and the formula ArrDelay ~. indicates that the arrival delay will be the attribute to be predicted. The expression df[,predcols] projects solely the columns to be in the model. In order to build all the models, groupapply() is invoked as follows: > models <- groupapply(data = train, groupingcolumns = train$uniquecarrier, rfunction = buildmodel) > class(models) [1] bigr.list where UniqueCarrier is the grouping column and buildmodel() is the function to be run for each group. Since models could occupy huge amounts of memory (e.g., they could contain regression residuals), groupapply() stores them on HDFS. Note that models is a bigr.list object that provides access to the regression trees and it contains twenty nine elements (i.e., one per airline). One or more models could also be brought to the client for analysis/visualization through function bigr.pull(), specifying a grouping column value as a reference (e.g., models$ha to retrieve the model for Hawaiian Airlines): > modelha <- bigr.pull(models$ha) > class(modelha) [1] "rpart" 2) Model scoring: Once the models have been built on the server, they can be used to make predictions. A function that scores the models (i.e., predicts the arrival delay) for each partition (i.e., airline) is presented below: scoremodels <- function(df, models) { library(rpart) carrier <- df$uniquecarrier[1] model <- bigr.pull(models[carrier]) predictions <- predict(model, df) return(data.frame(carrier, df$depdelay, df$arrdelay, predictions)) This function takes two arguments: the testing set df, as a data.frame, and models, abigr.list which contains the trees as rpart objects. Since the testing set will also be split by airline, we can be assured that all rows in df come from the same carrier. After the carrier is identified, the corresponding model is pulled and the predictions are generated by R s method predict(). Finally, a data.frame with four columns (i.e., airline, actual departure delay, actual arrival delay, and predicted arrival delay) is returned. Notice that function scoremodels() does not run locally on the client but it does so on the cluster as groupapply() spawns an R instance per group on the cluster. Finally, groupapply() is executed for the testing set (i.e., test), grouping by test$uniquecarrier, and invoking the function scoremodels(). A signature argument is passed at the very end to specify the schema of the resulting dataset containing the predictions, which will be materialized on HDFS and returned as a bigr.frame: preds <- groupapply(test, test$uniquecarrier, scoremodels, signature=data.frame(carrier= Carrier, DepDelay=1.0, ArrDelay=1.0, ArrDelayPred=1.0, stringsasfactors=f) ) Let us explore what is in preds: > class(preds) [1] bigr.frame > head(preds, 5) carrier DepDelay ArrDelay ArrDelayPred 1 HA HA HA HA HA We could also be interested in measuring the quality of the predictions by calculating the root mean square deviation (RMSD). This is easily accomplished with Big R s arithmetic functions and operators, exactly as if preds was a data.frame: > rmsd <- sqrt(sum((preds$arrdelay - preds$arrdelaypred) ^ 2) / nrow(preds)) > print(rmsd) [1] In this case, the prediction error was very high, meaning that either the features or the regression algorithm are not the best choice to predict the arrival delay. Our intention with this exercise, however, was not to build a highly accurate model but rather to illustrate how Big R s partitioned execution becomes useful in big data machine learning applications. The reader may also notice that, just as rpart, any other algorithm could be plugged into groupapply()

8 D. Large-scale machine learning The reader may observe that, in order for partitioned execution to be effective, each partition should be small enough to fit in the main memory of the cluster nodes. Nevertheless, many applications could require to build models on the entire dataset which might be larger. For this purpose, we are actively exploring extending Big R so as to include largescale machine learning capabilities out of the box. These will take the form of a rich library of functions (e.g., bigr.lm() for linear regression) which will work similarly to their R counterparts but being able to scale to arbitrarily large data. V. PERFORMANCE A. Performance evaluation Since Big R relies on JaQL queries, we refer the reader to [10] for the JaQL performance evaluation. In that paper, the scale-up is measured using different workloads. The analysis shows how JaQL scales linearly with up to 40 nodes, with a maximum data size of 1.4 TB. B. Local processing Hadoop suffers of a high overhead inherent to the job creation. This makes it ineffective for interactive applications, especially when the user needs to experiment with small data samples before jumping into the entire dataset. Big R takes advantage of Jaql s local processing features to provide sub-second response times when working with small datasets. In such cases, no MapReduce jobs are created, but all operations on the data are performed locally on one of the nodes of the cluster. The user can specify whether MapReduce should be used by means of argument usemapreduce when creating a bigr.frame: bigr.frame(datapath = "homes.csv", datasource = bigr.env$del, usemapreduce=false) VI. CONCLUSIONS AND LESSONS LEARNED This paper has presented Big R as a new framework coupling R and Hadoop to provide large scale distributed analytics. Compared to other products, Big R offers clear usability advantages since users do not need to deal with MapReduce programming. Instead, Big R extends R primitives to seeminglessly manipulate, visualize, and analyze big data. Moreover, any function from the extensive CRAN package repository can be pushed to the cluster via Big R s partitioned execution. We summarize some of the lessons learned as follows: Data scientists and business analysts are hesitant to embrace MapReduce programming, even if the map() and reduce() operations are available in R. Most users are not interested in learning a new paradigm, especially if it involves dealing with the code parallelization. In this sense, Big R looks appealing to the R community, since it inherits R primitives for big data analytics. R and Hadoop are not natural friends. While Hadoop is scalable, robust, and inexpensive, it inherently exhibits high latency. This brings about user experience challenges as R users are accustomed to interactive data analytics. Big R features local processing in the case of small data, but experiments with larger datasets which do require MapReduce cannot deliver immediate results. As an alternative to MapReduce, we are exploring other technologies such as Apache Spark, which relies on in-main-memory computations by caching the data from HDFS. Different studies conclude that Spark could outperform Hadoop in a factor of up to 100X [11]. Partitioned execution entails an additional cost for transferring the data from HDFS to R. Thus, simple computations (e.g., aggregated statistics as the mean) are discouraged in the context of partitioned execution because the data transfer cost would be much higher than the computations themselves. Big R s *apply() functions are rather intended for heavy computations such as machine learning algorithms. Large-scale machine learning algorithms able to operate on arbitrarily large training and testing datasets are a must in any big data analytics system. Hence, this is being part of our current work. REFERENCES [1] Global Futures and Foresight, The Futures Report 2011, [2] Apache Hadoop, [3] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Commun. ACM, vol. 51, pp , Jan [4] The R Project, [5] R. Analytics, What is R?, revolution-computing.com/what-r. [6] Revolution Analytics R Hadoop Tutorial, r-and-hadoop-step-by-step-tutorials.html. [7] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. Cleveland, Large Complex Data: Divide and Recombine (D&R) with RHIPE, Stat, vol. 1, no. 1, [8] Airline dataset, United States Department of Transportation, [9] IBM InfoSphere BigInsights, software/data/infosphere/biginsights. [10] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C. C. Kanne, F. Özcan, and E. J. Shekita, Jaql: A Scripting Language for Large Scale Semistructured Data Analysis, PVLDB, vol. 4, no. 12, pp , [11] Apache Spark,