Big R: Large-scale Analytics on Hadoop using R

Size: px
Start display at page:

Download "Big R: Large-scale Analytics on Hadoop using R"

Transcription

1 2014 IEEE International Congress on Big Data Big R: Large-scale Analytics on Hadoop using R Oscar D. Lara, Weiqiang Zhuang, and Adarsh Pannu IBM Silicon Valley Laboratory San Jose, CA, USA {odlaraye, wzhuang, Abstract As the volume of available data continues to rapidly grow from a variety of sources, scalable and performant analytics solutions have become an essential tool to enhance business productivity and revenue. Existing data analysis environments, such as R, are constrained by the size of the main memory and cannot scale in many applications. This paper introduces Big R, a new platform which enables accessing, manipulating, analyzing, and visualizing data residing on a Hadoop cluster from the R user interface. Big R is inspired by R semantics and overloads a number of R primitives to support big data. Hence, users will be able to quickly prototype big data analytics routines without the need of learning a new programming paradigm. The current Big R implementation works on two main fronts: (1) data exploration, which enables R as a query language for Hadoop and (2) partitioned execution, allowing the execution of any R function on smaller pieces of a large dataset across the nodes in the cluster. Keywords-Big data, MapReduce, machine learning. I. INTRODUCTION In the past decade, there has been an exceptional increase in the amount of available data, stemming from social networks, mobile sensing data, and multimedia, among other sources. In 2009, the growth rate of the digital universe reached 62%, resulting in 1.2 zettabytes of data (i.e., 1.2 billion terabytes!) [1]. It is estimated that, by 2020, this amount will be 44 times larger, whereas 80% of the data universe qualifies as unstructured. Therefore, the most essential goal is now the extraction of useful information from these massive amounts of data in an efficient, convenient, and user-friendly manner. The applications of the emerging area of big data analytics are countless: a financial corporation, for instance, could be interested in automatically detecting fraudulent transactions; likewise, insurance companies are nowadays focused on analyzing high volumes of telemetry data to determine whether the driver s behavior might be prone to traffic accidents; airlines are also willing to analyze which factors may cause flight delays, in order to optimize their scheduling policies. Hadoop [2], the most widely used implementation of the MapReduce [3] paradigm, came into picture in 2005 as an effective alternative to tackle big data problems. When the amount of data is extremely large, Hadoop offers a cost-effective solution by providing high performance computations on a number of inexpensive machines working parallelly. However, the power of Hadoop has not been fully realized in the analytics domain. Current projects, such as Apache Mahout, offer a very limited set of statistical tools and machine learning algorithms. R [4], on the other hand, is one of the most complete and broadly used environments for data analysis and mining. There are at least two million R users over the world, many of this in the academic and corporative sectors [5]. R incorporates a huge arsenal of data mining algorithms, as well as powerful mechanisms for visualization and data manipulation, in more than five thousand available packages [4]. Nevertheless, R also has a big limitation: it relies on in-main-memory data, which makes it unsuitable for large datasets (i.e., datasets which cannot fit in a single machine s main memory). As an alternative, data scientists usually apply sampling or partitioning to handle large datasets in R. For such purpose, they usually need to write scripts in other languages such as Perl, Java, or C to sample or divide the data into pieces that are small enough to fit in memory. This process could be extremely tedious, slow, and error prone. Besides, sampling and partitioning often reduce models accuracy by discarding potentially valuable data. In this paper, we introduce Big R as a new platform for exploiting the best of both worlds: Hadoop and R. Big R extends a variety of R primitives, classes, operators, and functions to support big data, thereby making R scalable while exploiting the parallelization power inherent to Hadoop. Big R allows users to access, manipulate, visualize, and analyze data on HDFS (in flat files, Hive, or HBase) in a transparent manner. In fact, Big R primitives and semantics are very similar (even identical, in many cases) to R s, facilitating the creation of new routines without the need of learning a new language or paradigm. Big R is intended to work on three main fronts: (1) data exploration, which enables R as a query language for Hadoop; (2) partitioned execution, allowing the execution of any R function on smaller pieces of a large dataset across the nodes in the cluster; and (3) large scale machine learning, opening the possibilities to train and evaluate data mining algorithms on arbitrarily big data. The first two fronts are available in our current implementation, while the third one is still under development. The rest of the paper is organized as follows: Section II gives an overview of the current work on integrating R and Hadoop. Section III describes the design and main features of Big R. Section IV illustrates the most relevant use cases and exposes /14 $ IEEE DOI /BigData.Congress

2 part of the functional specification of the system. Section V covers some performance considerations and finally, Section VI concludes the paper and summarizes the lessons learned. II. RELATED WORK There have been a number of efforts to integrate R and Hadoop. But perhaps, the most relevant ones are RHadoop [6] and RHIPE [7]. These two solutions, nonetheless, only provide an API to invoke MapReduce from R, requiring users to write code for both mappers and reducers using R. This brings about usability challenges, especially for data scientists who are not expected to have any experience or interest on the internal parallelization of their code, but are rather focused on analyzing their data. In contrast, Big R features a more user-friendly interface by mimicking R primitives, operators, and functions, thereby hiding the underlying complexity of MapReduce programming to illustrate this, we will take a look at a few examples shortly. Furthermore, Big R leverages the entire spectrum of R capabilities by (1) partitioning the data in R-able pieces and (2) parallelizing the execution in multiple R instances running on a Hadoop cluster. Although RHIPE also follows a partition-based approach, it does not embed grouping capabilities out of the box, which Big R does through groupapply() (see Section IV-C). Throughout this paper, we will use the well known airline dataset [8] from the United States Department of Transportation. This dataset contains more than 20 years of flight arrival/departure information and has been extensively analyzed in research studies and benchmarks. Now, suppose we are interested in computing the mean departure delay for each airline on a monthly basis. In R Hadoop, such a simple query would require the implementation shown in Figure 1. csvtextinputformat = function(line) keyval(null, unlist(strsplit(line, "\\,"))) deptdelay = function (input, output) { mapreduce(input = input, output = output, textinputformat = csvtextinputformat, map = function(k, fields) { # Skip header lines and bad records. if (!(identical(fields[[1]], "Year")) & length(fields) == 29) { deptdelay <- fields[[16]] # Skip records where departure delay is "NA" if (!(identical(deptdelay, "NA"))) { # field[9] is carrier, field[1] is year, field[2] is month keyval(c(fields[[9]], fields[[1]], fields[[2]]), deptdelay), reduce = function(keysplit, vv) { keyval(keysplit[[2]], c(keysplit[[3]], length(vv), keysplit[[1]], mean(as.numeric(vv)))) ) from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month")) Figure 1. RHadoop code for calculating the mean departure delay for each airline in each month [6]. An RHIPE user, on the other hand, will need to write the code fragment shown in Figure 2 to execute the same query [7]. In turn, using Big R, this query can be accomplished with only two lines of code: rhinit(true, TRUE) # Output from map is: # "CARRIER YEAR MONTH \t DEPARTURE_DELAY" map <- expression({ # For each input record, parse out required fields and output new record: extractdeptdelays = function(line) { fields <- unlist(strsplit(line, "\\,")) # Skip header lines and bad records: if (!(identical(fields[[1]], "Year")) & length(fields) == 29) { deptdelay <- fields[[16]] # Skip records where departure dalay is "NA": if (!(identical(deptdelay, "NA"))) { # field[9] is carrier, field[1] is year, field[2] is month: rhcollect(paste(fields[[9]], " ", fields[[1]], " ", fields[[2]], sep=""), deptdelay) # Process each record in map input: lapply(map.values, extractdeptdelays) ) # Output from reduce is: # YEAR \t MONTH \t RECORD_COUNT \t AIRLINE \t AVG_DEPT_DELAY reduce <- expression( pre = { delays <- numeric(0), reduce = { # Depending on size of input, reduce will get called multiple times # for each key, so accumulate intermediate values in delays vector: delays <- c(delays, as.numeric(reduce.values)), post = { # Process all the intermediate values for key: keysplit <- unlist(strsplit(reduce.key, "\\ ")) count <- length(delays) avg <- mean(delays) rhcollect(keysplit[[2]], paste(keysplit[[3]], count, keysplit[[1]], avg, sep="\t")) ) inputpath <- "/data/airline/" outputpath <- "/dept-delay-month" # Create job object: z <- rhmr(map=map, reduce=reduce, ifolder=inputpath, ofolder=outputpath, inout=c( text, text ), jobname= Avg Departure Delay By Month, mapred=list(mapred.reduce.tasks=2)) # Run it: rhex(z) Figure 2. RHIPE code for calculating the mean departure delay for each airline in each month [7]. > air <- bigr.frame(datapath = "airline.csv", datasource = DEL, na.string="na") > summary(mean(depdelay) ~ UniqueCarrier + Year + Month, dataset = air) The reader may notice that Big R tremendously enhances productivity in the big data analytics context. The summary() function provided by Big R supports aggregate operations (e.g., mean()) and grouping columns (e.g., UniqueCarrier, Year, and Month). Internally, Big R parallelizes the execution on Hadoop by means of IBM BigInsights [9] and the result will be returned as a data.frame (R s primary data structure to handle tabular data). We will examine summary() as well as other functions in the upcoming chapters. III. AN OVERVIEW OF BIG R Big R is an R package, part of the IBM InfoSphere BigInsights [9] platform IBM s distribution of Hadoop. Big R overloads a number of R functions and operators to be able to access big data on a Hadoop cluster. It is worth highlighting that Big R users only need to write 100% R code, not worrying about the internal MapReduce parallelization mechanism. In addition, Big R is self contained, enabling users to access, manipulate, visualize, and analyze big data, all from within an R IDE such as RStudio. Figure 3 displays

3 the high level Big R architecture. Big R offers an interface to communicate with BigInsights for accessing data residing on HDFS (i.e., flat files, Hive, or HBase). With Big R, the R language becomes a powerful big data query language, able to run data exploration and analytics functions with similar or even identical semantics to R s on MapReduce. Furthermore, users can exploit the partitioned execution capabilities to leverage any R function and run it on smaller chunks of the data parallelly on the cluster. Finally, our framework allows for an easy integration of large-scale machine learning algorithms, which can be used to train and evaluate models on the entire data. Figure 3. Big R architecture A. Classes Table I summarizes Big R core classes. Big R mimics some of R s data abstraction classes to provide access to big data sources. Today, R users can easily load a small (i.e., able to fit in main memory) tabular dataset into an object of class data.frame. Big R, analogously, offers bigr.frame as the scalable version of data.frame. Likewise, bigr.vector and bigr.list are inspired by R s vector and list types. bigr.vectors are usually the result of projecting a column on a bigr.frame, whereas bigr.lists are fundamental for partitioned execution. Finally, class bigr.connection enables accessing data on BigInsights from R via JDBC. Class bigr.frame bigr.vector bigr.list bigr.connection Table I BIG R CORE CLASSES Description A proxy to a tabular dataset on a Hadoop data source. Big R s version of data.frame. A proxy to a uni-dimensional dataset on a Hadoop data source. Big R s version of vector. A proxy to a collection of serialized R objects that were generated as a result of partitioned execution. A handler to connect R with IBM InfoSphere BigInsights. Since large datasets cannot be fully loaded in R due to main memory limitations, Big R classes are proxies to big data sources, giving users the feeling of R s interactive data access. Functions on these classes can be used to pull data to the client, yet this is only recommended for bringing in small slices of data, which are typically summaries or aggregations. A more powerful method of operation is to push an R function right to the data themselves by means of partitioned execution. B. Operators Big R overloads arithmetic, logical, and data manipulation operators to be applied to bigr.frames, bigr.vectors, and bigr.lists. Table II summarizes Big R s available operators. Arithmetic, relational, and logical operators are applicable to two bigr.vectors or to a bigr.vector and a primitive R data type (e.g., integer, numeric, logical, orcharacter). Operator [] allows for filtering elements on bigr.vectors or bigr.frames given relational or logical conditions. Operator $ projects a column in a bigr.frame or an element in a bigr.list. Finally, operator [,] combines both filtering and projection capabilities for bigr.frames. In the next section, we will provide examples on how to employ these operators in a big data context. Table II BIG R OPERATORS Operators Description + - * / %% ^ Arithmetic operators applicable among bigr.vectors or between bigr.vectors and R primitive data types. [] [,] $ $<- Projection / filtering operators applicable to bigr.frames. [] $ Filtering/indexing operators for bigr.lists and bigr.vectors. &! Logical operators applicable to bigr.vectors. == > <!= <= >= Relational operators applicable among bigr.vectors %in% or between bigr.vectors and R primitive data types. C. From R to MapReduce Turning every R construct directly into MapReduce could be a very challenging task. Thus, Big R relies on Jaql [10], a scripting language for querying and manipulating large datasets on Hadoop. Jaql s most appealing feature is that it internally translates a given query into MapReduce jobs in a transparent fashion. Big R takes advantage of this abstraction, overloading R s core operators and functions through R generic methods to generate Jaql queries. Such queries are executed on the BigInsights server via JDBC. Jaql is a functional language and it features a pipe (->) operator, which turns out to be very useful to stream the output of a function as the input of another one. Following this paradigm, every Big R operation generates a Jaql function which will be appended to the pipe chain. Let us examine a few examples on how to manipulate big data with Jaql. First, in order to read a CSV file on HDFS, Jaql offers the read() and del() functions: read(del(location= dataset.csv ));

4 where dataset.csv is a file stored on HDFS. Now, say we need to project the first three columns and select the rows such that the fourth column is greater than 10: read(del(location= file5.txt )) -> filter ($[3] >= 10) -> transform [$[0], $[1], $[2]]; The transform expression allows to project columns by specifying column identifiers, while filter selects rows that fulfill the given condition(s). In plain R, projection and selection in a data.frame are handled by operator [,]. Big R overloads this operator to work with bigr.frames, generating the corresponding Jaql query. Jaql incorporates a number of built-in data manipulation constructs which naturally map to many R primitives (i.e., selection/projection, sorting, arithmetics, etc.). In all other cases, we wrote our own Jaql functions to extend R capabilities to big data. The reader might refer to [9], [10] for more details on Jaql. IV. BIG R USE CASES As described earlier, Big R features two implemented use cases: data exploration and partitioned execution. Next, let us explore these scenarios in greater detail. The first step to start working with Big R is to establish a connection to BigInsights. This can be easily accomplished with method bigr.connect(), indicating the address of one of the cluster nodes along with the authentication information. bigr.connect() creates a JDBC connection to BigInsights, enabling implicit execution of Jaql queries and updates from the R client: > library(bigr) > bigr.connect(host= , user= biadmin, password= biadmin ) A. Data exploration Big R enables the use of R as a query language for Hadoop, providing methods to filter, project, sort, join, and recode, just to mention a few. Additionally, Big R has sampling capabilities and offers built-in analytics functions as part of the data exploration. 1) Accessing big data: Once a connection has been made, a bigr.frame (i.e., Big R s version of data.frame) can be created as follows: > air <- bigr.frame(datasource= DEL, datapath="/user/biadmin/airline.csv", header=true) where air is an instance of class bigr.frame, pointing to a delimited HDFS file airline.csv located on /user/biadmin. The file has headers (parameter header was set to TRUE), meaning that column names will be automatically picked up from the file. At this point, no data has been loaded in the R client, but only a proxy to the airline dataset has been initialized. 2) Data transformation functions: Table III compiles some of the most important Big R s data transformation functions. Almost all of them share the exact R semantics, extending their capabilities to large scale datasets. Table III DATA TRANSFORMATION FUNCTIONS Functions head(), tail() str(), show(), print() colnames(), coltypes(), colnames<-, coltypes<attach(), with() sort(), merge() bigr.persist() ifelse() Description First or last k elements of a bigr.frame or bigr.vector. Visualize a bigr.frame, bigr.vector, orbigr.list. Return/assign column names and types for a bigr.frame. Direct access to the columns of a bigr.frame. Sort a bigr.frame or bigr.vector. Join two bigr.frames. Export a bigr.frame, bigr.vector, or bigr.list to a persistent data source. Recode a bigr.vector. # Allow accessing air s columns directly > attach(air) # Filter non-canceled delayed flights for American and # Hawaiian Airlines. Project five columns. > airsubset <- air[(canceled == 0) & (DepDelay >= 10) & (UniqueCarrier %in% c( AA, HA )), c("uniquecarrier", "Origin", "Dest", "DepDelay", "ArrDelay")] # Calculate the average between departure # and arrival delay. > airsubset$avgdelay <- (airsubset$arrdelay + airsubset$depdelay) / 2 # Recode average delay as High, Medium, or Low. > airsubset$delay <- ifelse(airsubset$avgdelay > 30, High, ifelse(airsubset$avgdelay < 20, Low, Medium )) # Sort by airline (ascending) and origin (descending) > report <- sort(airsubset, list(airsubset$uniquecarrier,airsubset$origin), decreasing=c(false, TRUE)) # Show the structure of report. > str(report) bigr.frame : 7 variables: $ UniqueCarrier: chr "AA" "AA" "AA" "AA" "AA" "AA"... $ Origin : chr "XNA" "XNA" "XNA" "TVL" "TUS" "TUS"... $ Dest : chr "ORD" "DFW" "DFW" "SJC" "ORD" "DFW"... $ DepDelay : int $ ArrDelay : int 31 NA $ AvgDelay : num 37.5 NA $ Delay : chr "High" "Medium" "Low" "Medium"... Figure 4. Data transformation functions in action. Figure 4 exposes some of Big R s data transformation functions. Notice that projection/filtering operators [,] and $ work for bigr.frames exactly as they do for data.frames! Likewise, relational, logical, and arithmetic operators work with bigr.vectors just as they do with vectors. sort() function, however, does introduce new semantics as R s sort() is only intended for vectors. Big R s sort() allows sorting a bigr.frame by specifying the list of columns to order by and whether they should be sorted in increasing or decreasing order. Finally, the result can be displayed just as in R by means of function str(). 3) Sampling: Operations on very large datasets are usually processed offline, since they tend to take very long to complete. Hence, data scientists first evaluate different algorithms on small data samples before actually modeling

5 the entire dataset. In this direction, Big R offers sampling capabilities to enable rapid prototyping of big data analytics routines. Sampling methods take as input a bigr.frame and also return the corresponding sample(s) as a bigr.frame or a list of bigr.frames. Three sampling methods are supported for a dataset with n rows: Proportional sampling: for a specific proportion 0 < p<1, this method returns a sample with approximately np rows. For this purpose, a uniform random number 0 ε 1 is generated for each row and then, the rows such that ε<pwill be returned. This approach can be computed in O(n) after a single pass through the data. When n (which is the case of big data), the number of rows in the sample will approach to np. Fixed-size sample: for a specific size 0 < k < n, this method returns a sample of exactly k rows with confidence 1 α. For this purpose, a coin is tossed on each row with a fixed probability of success q. In order to guarantee at least k rows, the value of q is calculated from the binomial distribution confidence interval, using the normal approximation: 1 q = p + z α 1/2 p(1 p) (1) n Then, the top k rows will be picked. Notice this algorithm tends to generate more rows than needed to guarantee at least k of them. Therefore, first rows will have slightly more chances of being chosen. Such bias becomes smaller as the data size n. Partitioned sampling: given a vector of proportions r p =[p i ] of size r such that p i =1and p i > 0, i=1 this method divides the original dataset into r disjoint samples with sizes approximately np rows. As n, the number of rows in the partitions approaches to np. Some sampling examples are presented below: # A random 1% sample of the data > sample1 <- bigr.sample(air, perc=0.01) # A random partition of the dataset into 70% and 30% > samples <- bigr.sample(air, perc=c(0.7, 0.3)) # A random sample with exactly 100 rows > sample2 <- bigr.sample(air, nsamples=100) sample1 and sample2 are bigr.frames, whereas samples is a list of two non-overlapping bigr.frames with approximately 70% and 30% of the data. By default, samples are stored as temporary files on HDFS. These files will be cleaned up by R s garbage collector once the objects are no longer referenced. 4) Analytics functions: Table IV shows Big R s builtin analytics functions applicable to bigr.vectors. Univariate and bivariate statistics functions can be used to compute mean, standard deviation, quartiles, correlation, and covariance. Unique values and distinct counts can also be calculated with the table() function. Additional arithmetic operations such as absolute value, logarithms, power, and square root, among others are also supported by Big R. Functions mean(), sd(), var(), cov(), cor(), quartiles() summary(), min(), max(), range(), sum(), count(), mean() unique(), table() abs(), sign(), log(), pow(), sqrt() Table IV ANALYTICS FUNCTIONS Description Univariate / bivariate statistics. Aggregate functions. Could be applied on the entire data or on a group basis via summary(). Distinct values and counts for each value. Arithmetic functions. Furthermore, Big R features the summary() function, which allows to calculate aggregate functions on the given columns on a group basis. Next section, which addresses data visualization, heavily relies on summary() to calculate statistics which will be the input of R s visualization functions. It is worth highlighting that these functions are only a small subset of Big R s analytics capabilities. Virtually any R function can be pushed to the server using partitioned execution (see Section IV-C). B. Data visualization Figure 5. Conditioned histogram and boxplot calculated by Big R and rendered using the ggplot2 package. Big R supports two built-in big data plots: histograms and boxplots. These can be applied to an entire dataset (i.e., a bigr.vector), as well as in a conditioned fashion using R s formula notation. In the latter case, one or more grouping columns may be specified. A conditioned histogram plot consists of a set of histograms (i.e., one per group) displayed on a grid. Figure 5 shows the distribution of the arrival delay (i.e., ArrDelay) for six different airlines. The number of

6 bins has been set to 20. Observe that American Airlines (AA) and U.S. Airlines (US) have a much higher proportion of delayed flights. This is expected as they manage a higher volume of flights on busier airports. Aloha and Hawaiian Airlines (AQ and HA), on the other hand, exhibit rather low arrival delays. In addition, the plot indicates that positive region of the arrival delay follows an exponential distribution, which is an expected result as well. In Big R, plotting conditioned histograms turns out to be very simple: # Remove outliers and select target airlines > airfiltered <- air[air$arrdelay < 60 & air$uniquecarrier %in% c("aq", "AA", "DH", "HA", "UA", "US")] # Plot the conditioned histogram chart > bigr.histogram(airfiltered$arrdelay ~ airfiltered$uniquecarrier, nbins=20) First, airfiltered is created as a subset of air, after discarding outliers (i.e., when the arrival delay is more than one hour) and filtering six specific airlines. Then, the function bigr.histogram() is invoked to plot the histogram. Its first parameter is a formula, in which the left side indicates the target column whereas the right side specifies the grouping column(s). Additionally, the number of bins can be set by means of argument nbins. If no grouping columns are specified, a single histogram will be rendered for the entire dataset. Conditioned box plots are also included in Big R, where each box (i.e., range, quartiles, and mean) corresponds to a group. Using the same airfiltered dataset, a boxplot of the arrival delay for each airline is shown in Figure 5. The plot was generated with method bigr.boxplot(), specifying the target column and the grouping column as a formula: # Plot the conditioned histogram chart > bigr.boxplot(airfiltered$arrdelay ~ airfiltered$uniquecarrier) Although Big R only includes histograms and boxplots out of the box, the user can leverage analytics functions to build many other insightful visualizations. Function summary() plays a crucial role here, allowing to compute aggregated statistics which will be the input of a number of rendering functions available in R s extensive visualization library. As an example, Figure 6 shows the daily flight volume in 2000 through Observe that dates such as the day before Thanksgiving (Nov 22, 2001) and Independence day (4th of July, 2001), among others, have a higher concentration of flights. Moreover, the reader can easily spot the day with almost no flights: September 12, 2001, the day after the 9/11 attacks. The data required to generate this plot was calculated using a rather simple Big R query: > summary(count(.) ~ Month + DayofMonth + Year, data=air[air$year %in% c(2000, 2001, 2002) & air$canceled == 0, ]) This query counts the number of non-canceled flights in 2000, 2001, and 2002, grouping by day (i.e., DayofMonth), month (i.e., Month), and year (i.e., Year). summary() returns a data.frame with the daily flight volumes, so the entire raw data never reaches the R client. Some additional data pre-processing needs to be done to construct the heat map shown in Figure 6. Figure 6. Flight volume in early 2000 s Figure 7 displays another example of data visualizations powered by Big R. In this case, the busiest flight routes are shown on a map. The Big R query to make this visualization possible is as follows: summary(count(.) ~ Origin + Dest, data=airline) Observe this is nothing but counting the flights while grouping by origin and destination. Notice that larger cities such as New York, Houston, Atlanta, San Francisco, Chicago, among others, exhibit the largest flows of flights. Then, the airports need to be geo-coded and some additional pre-processing is required to display the routes on the map. The reader may refer to packages map and geosphere for more details on the generation of this plot. Figure 7. Busiest flight routes in the US. C. Partitioned execution Formerly, we have shown Big R s features as a big data query language and how it can be used to generate insightful visualizations on large datasets. But perhaps the most powerful use case of Big R is partitioned execution. Here, the entire spectrum of R analytics and modeling packages can be shipped to the cluster. Big R s partitioned execution functions are inspired by R s *apply() family and they are able to execute a given function on smaller pieces of

7 a dataset according to certain split criteria. Partitions may be arranged (1) on a group basis (i.e., groupapply()), indicating one or more grouping columns, or (2) on equallysized chunks (i.e., rowapply()), given the number of rows in each partition. In order to illustrate Big R s partitioned execution capabilities, let us build regression tree models for the arrival delay of each airline. First, we create both training and testing sets by means of the bigr.sample() function: > split <- bigr.sample(air, perc=c(0.3, 0.7)) > train <- split[[1]] > test <- split[[2]] 1) Training models: In order to train the models, we will use package rpart, which includes the recursive partitioning and regression tree algorithm. groupapply() requires three arguments: a dataset (i.e., a bigr.frame), a set of grouping columns (e.g., UniqueCarrier), and an R function to invoke. In our case, such function, called buildmodel(), will generate a regression tree on each partition as follows: buildmodel <- function(df) { library(rpart) predcols <- c( ArrDelay, DepDelay, DepTime, CRSArrTime, Distance ) model <- rpart(arrdelay ~., df[,predcols]) return(model) where df is the training set, predcols contains the attributes that will be part of the model, and the formula ArrDelay ~. indicates that the arrival delay will be the attribute to be predicted. The expression df[,predcols] projects solely the columns to be in the model. In order to build all the models, groupapply() is invoked as follows: > models <- groupapply(data = train, groupingcolumns = train$uniquecarrier, rfunction = buildmodel) > class(models) [1] bigr.list where UniqueCarrier is the grouping column and buildmodel() is the function to be run for each group. Since models could occupy huge amounts of memory (e.g., they could contain regression residuals), groupapply() stores them on HDFS. Note that models is a bigr.list object that provides access to the regression trees and it contains twenty nine elements (i.e., one per airline). One or more models could also be brought to the client for analysis/visualization through function bigr.pull(), specifying a grouping column value as a reference (e.g., models$ha to retrieve the model for Hawaiian Airlines): > modelha <- bigr.pull(models$ha) > class(modelha) [1] "rpart" 2) Model scoring: Once the models have been built on the server, they can be used to make predictions. A function that scores the models (i.e., predicts the arrival delay) for each partition (i.e., airline) is presented below: scoremodels <- function(df, models) { library(rpart) carrier <- df$uniquecarrier[1] model <- bigr.pull(models[carrier]) predictions <- predict(model, df) return(data.frame(carrier, df$depdelay, df$arrdelay, predictions)) This function takes two arguments: the testing set df, as a data.frame, and models, abigr.list which contains the trees as rpart objects. Since the testing set will also be split by airline, we can be assured that all rows in df come from the same carrier. After the carrier is identified, the corresponding model is pulled and the predictions are generated by R s method predict(). Finally, a data.frame with four columns (i.e., airline, actual departure delay, actual arrival delay, and predicted arrival delay) is returned. Notice that function scoremodels() does not run locally on the client but it does so on the cluster as groupapply() spawns an R instance per group on the cluster. Finally, groupapply() is executed for the testing set (i.e., test), grouping by test$uniquecarrier, and invoking the function scoremodels(). A signature argument is passed at the very end to specify the schema of the resulting dataset containing the predictions, which will be materialized on HDFS and returned as a bigr.frame: preds <- groupapply(test, test$uniquecarrier, scoremodels, signature=data.frame(carrier= Carrier, DepDelay=1.0, ArrDelay=1.0, ArrDelayPred=1.0, stringsasfactors=f) ) Let us explore what is in preds: > class(preds) [1] bigr.frame > head(preds, 5) carrier DepDelay ArrDelay ArrDelayPred 1 HA HA HA HA HA We could also be interested in measuring the quality of the predictions by calculating the root mean square deviation (RMSD). This is easily accomplished with Big R s arithmetic functions and operators, exactly as if preds was a data.frame: > rmsd <- sqrt(sum((preds$arrdelay - preds$arrdelaypred) ^ 2) / nrow(preds)) > print(rmsd) [1] In this case, the prediction error was very high, meaning that either the features or the regression algorithm are not the best choice to predict the arrival delay. Our intention with this exercise, however, was not to build a highly accurate model but rather to illustrate how Big R s partitioned execution becomes useful in big data machine learning applications. The reader may also notice that, just as rpart, any other algorithm could be plugged into groupapply()

8 D. Large-scale machine learning The reader may observe that, in order for partitioned execution to be effective, each partition should be small enough to fit in the main memory of the cluster nodes. Nevertheless, many applications could require to build models on the entire dataset which might be larger. For this purpose, we are actively exploring extending Big R so as to include largescale machine learning capabilities out of the box. These will take the form of a rich library of functions (e.g., bigr.lm() for linear regression) which will work similarly to their R counterparts but being able to scale to arbitrarily large data. V. PERFORMANCE A. Performance evaluation Since Big R relies on JaQL queries, we refer the reader to [10] for the JaQL performance evaluation. In that paper, the scale-up is measured using different workloads. The analysis shows how JaQL scales linearly with up to 40 nodes, with a maximum data size of 1.4 TB. B. Local processing Hadoop suffers of a high overhead inherent to the job creation. This makes it ineffective for interactive applications, especially when the user needs to experiment with small data samples before jumping into the entire dataset. Big R takes advantage of Jaql s local processing features to provide sub-second response times when working with small datasets. In such cases, no MapReduce jobs are created, but all operations on the data are performed locally on one of the nodes of the cluster. The user can specify whether MapReduce should be used by means of argument usemapreduce when creating a bigr.frame: bigr.frame(datapath = "homes.csv", datasource = bigr.env$del, usemapreduce=false) VI. CONCLUSIONS AND LESSONS LEARNED This paper has presented Big R as a new framework coupling R and Hadoop to provide large scale distributed analytics. Compared to other products, Big R offers clear usability advantages since users do not need to deal with MapReduce programming. Instead, Big R extends R primitives to seeminglessly manipulate, visualize, and analyze big data. Moreover, any function from the extensive CRAN package repository can be pushed to the cluster via Big R s partitioned execution. We summarize some of the lessons learned as follows: Data scientists and business analysts are hesitant to embrace MapReduce programming, even if the map() and reduce() operations are available in R. Most users are not interested in learning a new paradigm, especially if it involves dealing with the code parallelization. In this sense, Big R looks appealing to the R community, since it inherits R primitives for big data analytics. R and Hadoop are not natural friends. While Hadoop is scalable, robust, and inexpensive, it inherently exhibits high latency. This brings about user experience challenges as R users are accustomed to interactive data analytics. Big R features local processing in the case of small data, but experiments with larger datasets which do require MapReduce cannot deliver immediate results. As an alternative to MapReduce, we are exploring other technologies such as Apache Spark, which relies on in-main-memory computations by caching the data from HDFS. Different studies conclude that Spark could outperform Hadoop in a factor of up to 100X [11]. Partitioned execution entails an additional cost for transferring the data from HDFS to R. Thus, simple computations (e.g., aggregated statistics as the mean) are discouraged in the context of partitioned execution because the data transfer cost would be much higher than the computations themselves. Big R s *apply() functions are rather intended for heavy computations such as machine learning algorithms. Large-scale machine learning algorithms able to operate on arbitrarily large training and testing datasets are a must in any big data analytics system. Hence, this is being part of our current work. REFERENCES [1] Global Futures and Foresight, The Futures Report 2011, [2] Apache Hadoop, [3] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Commun. ACM, vol. 51, pp , Jan [4] The R Project, [5] R. Analytics, What is R?, revolution-computing.com/what-r. [6] Revolution Analytics R Hadoop Tutorial, r-and-hadoop-step-by-step-tutorials.html. [7] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. Cleveland, Large Complex Data: Divide and Recombine (D&R) with RHIPE, Stat, vol. 1, no. 1, [8] Airline dataset, United States Department of Transportation, [9] IBM InfoSphere BigInsights, software/data/infosphere/biginsights. [10] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C. C. Kanne, F. Özcan, and E. J. Shekita, Jaql: A Scripting Language for Large Scale Semistructured Data Analysis, PVLDB, vol. 4, no. 12, pp , [11] Apache Spark,

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved

VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

More information

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

High Performance Predictive Analytics in R and Hadoop:

High Performance Predictive Analytics in R and Hadoop: High Performance Predictive Analytics in R and Hadoop: Achieving Big Data Big Analytics Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist August 27, 2013 1 Polling Questions 1 & 2 2 Agenda Revolution

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager lukhan@ebay.com @lukehq Yang Li Architect & Tech Leader yangli9@ebay.com Agenda What s Apache Kylin? Tech Highlights Performance

More information

Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.

Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D. Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D. OSCON Data 2011 Flow of this Talk Introductions Hadoop, R and Interfacing Our Prototypes A use case for interfacing

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL Scalable Network Measurement Analysis with Hadoop Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL Outline Motivation Hadoop overview Approach doing the right thing, Avro what worked,

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

IBM InfoSphere BigInsights Enterprise Edition

IBM InfoSphere BigInsights Enterprise Edition IBM InfoSphere BigInsights Enterprise Edition Efficiently manage and mine big data for valuable insights Highlights Advanced analytics for structured, semi-structured and unstructured data Professional-grade

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Big Data: Using ArcGIS with Apache Hadoop

Big Data: Using ArcGIS with Apache Hadoop 2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.

More information

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Trafodion Operational SQL-on-Hadoop

Trafodion Operational SQL-on-Hadoop Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

More information

Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Data Warehouse Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Creating a universe on Hive with Hortonworks HDP 2.0

Creating a universe on Hive with Hortonworks HDP 2.0 Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

R at the front end and

R at the front end and Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect A very short talk about Apache Kylin Business Intelligence meets Big Data Fabian Wilckens EMEA Solutions Architect 1 The challenge today 2 Very quickly: OLAP Online Analytical Processing How many beers

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Efficient Processing of XML Documents in Hadoop Map Reduce

Efficient Processing of XML Documents in Hadoop Map Reduce Efficient Processing of Documents in Hadoop Map Reduce Dmitry Vasilenko, Mahesh Kurapati Business Analytics IBM Chicago, USA dvasilen@us.ibm.com, mkurapati@us.ibm.com Abstract has dominated the enterprise

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information