# Big R: Large-scale Analytics on Hadoop using R

Size: px
Start display at page:

## Transcription

5 the entire dataset. In this direction, Big R offers sampling capabilities to enable rapid prototyping of big data analytics routines. Sampling methods take as input a bigr.frame and also return the corresponding sample(s) as a bigr.frame or a list of bigr.frames. Three sampling methods are supported for a dataset with n rows: Proportional sampling: for a specific proportion 0 < p<1, this method returns a sample with approximately np rows. For this purpose, a uniform random number 0 ε 1 is generated for each row and then, the rows such that ε<pwill be returned. This approach can be computed in O(n) after a single pass through the data. When n (which is the case of big data), the number of rows in the sample will approach to np. Fixed-size sample: for a specific size 0 < k < n, this method returns a sample of exactly k rows with confidence 1 α. For this purpose, a coin is tossed on each row with a fixed probability of success q. In order to guarantee at least k rows, the value of q is calculated from the binomial distribution confidence interval, using the normal approximation: 1 q = p + z α 1/2 p(1 p) (1) n Then, the top k rows will be picked. Notice this algorithm tends to generate more rows than needed to guarantee at least k of them. Therefore, first rows will have slightly more chances of being chosen. Such bias becomes smaller as the data size n. Partitioned sampling: given a vector of proportions r p =[p i ] of size r such that p i =1and p i > 0, i=1 this method divides the original dataset into r disjoint samples with sizes approximately np rows. As n, the number of rows in the partitions approaches to np. Some sampling examples are presented below: # A random 1% sample of the data > sample1 <- bigr.sample(air, perc=0.01) # A random partition of the dataset into 70% and 30% > samples <- bigr.sample(air, perc=c(0.7, 0.3)) # A random sample with exactly 100 rows > sample2 <- bigr.sample(air, nsamples=100) sample1 and sample2 are bigr.frames, whereas samples is a list of two non-overlapping bigr.frames with approximately 70% and 30% of the data. By default, samples are stored as temporary files on HDFS. These files will be cleaned up by R s garbage collector once the objects are no longer referenced. 4) Analytics functions: Table IV shows Big R s builtin analytics functions applicable to bigr.vectors. Univariate and bivariate statistics functions can be used to compute mean, standard deviation, quartiles, correlation, and covariance. Unique values and distinct counts can also be calculated with the table() function. Additional arithmetic operations such as absolute value, logarithms, power, and square root, among others are also supported by Big R. Functions mean(), sd(), var(), cov(), cor(), quartiles() summary(), min(), max(), range(), sum(), count(), mean() unique(), table() abs(), sign(), log(), pow(), sqrt() Table IV ANALYTICS FUNCTIONS Description Univariate / bivariate statistics. Aggregate functions. Could be applied on the entire data or on a group basis via summary(). Distinct values and counts for each value. Arithmetic functions. Furthermore, Big R features the summary() function, which allows to calculate aggregate functions on the given columns on a group basis. Next section, which addresses data visualization, heavily relies on summary() to calculate statistics which will be the input of R s visualization functions. It is worth highlighting that these functions are only a small subset of Big R s analytics capabilities. Virtually any R function can be pushed to the server using partitioned execution (see Section IV-C). B. Data visualization Figure 5. Conditioned histogram and boxplot calculated by Big R and rendered using the ggplot2 package. Big R supports two built-in big data plots: histograms and boxplots. These can be applied to an entire dataset (i.e., a bigr.vector), as well as in a conditioned fashion using R s formula notation. In the latter case, one or more grouping columns may be specified. A conditioned histogram plot consists of a set of histograms (i.e., one per group) displayed on a grid. Figure 5 shows the distribution of the arrival delay (i.e., ArrDelay) for six different airlines. The number of

6 bins has been set to 20. Observe that American Airlines (AA) and U.S. Airlines (US) have a much higher proportion of delayed flights. This is expected as they manage a higher volume of flights on busier airports. Aloha and Hawaiian Airlines (AQ and HA), on the other hand, exhibit rather low arrival delays. In addition, the plot indicates that positive region of the arrival delay follows an exponential distribution, which is an expected result as well. In Big R, plotting conditioned histograms turns out to be very simple: # Remove outliers and select target airlines > airfiltered <- air[air\$arrdelay < 60 & air\$uniquecarrier %in% c("aq", "AA", "DH", "HA", "UA", "US")] # Plot the conditioned histogram chart > bigr.histogram(airfiltered\$arrdelay ~ airfiltered\$uniquecarrier, nbins=20) First, airfiltered is created as a subset of air, after discarding outliers (i.e., when the arrival delay is more than one hour) and filtering six specific airlines. Then, the function bigr.histogram() is invoked to plot the histogram. Its first parameter is a formula, in which the left side indicates the target column whereas the right side specifies the grouping column(s). Additionally, the number of bins can be set by means of argument nbins. If no grouping columns are specified, a single histogram will be rendered for the entire dataset. Conditioned box plots are also included in Big R, where each box (i.e., range, quartiles, and mean) corresponds to a group. Using the same airfiltered dataset, a boxplot of the arrival delay for each airline is shown in Figure 5. The plot was generated with method bigr.boxplot(), specifying the target column and the grouping column as a formula: # Plot the conditioned histogram chart > bigr.boxplot(airfiltered\$arrdelay ~ airfiltered\$uniquecarrier) Although Big R only includes histograms and boxplots out of the box, the user can leverage analytics functions to build many other insightful visualizations. Function summary() plays a crucial role here, allowing to compute aggregated statistics which will be the input of a number of rendering functions available in R s extensive visualization library. As an example, Figure 6 shows the daily flight volume in 2000 through Observe that dates such as the day before Thanksgiving (Nov 22, 2001) and Independence day (4th of July, 2001), among others, have a higher concentration of flights. Moreover, the reader can easily spot the day with almost no flights: September 12, 2001, the day after the 9/11 attacks. The data required to generate this plot was calculated using a rather simple Big R query: > summary(count(.) ~ Month + DayofMonth + Year, data=air[air\$year %in% c(2000, 2001, 2002) & air\$canceled == 0, ]) This query counts the number of non-canceled flights in 2000, 2001, and 2002, grouping by day (i.e., DayofMonth), month (i.e., Month), and year (i.e., Year). summary() returns a data.frame with the daily flight volumes, so the entire raw data never reaches the R client. Some additional data pre-processing needs to be done to construct the heat map shown in Figure 6. Figure 6. Flight volume in early 2000 s Figure 7 displays another example of data visualizations powered by Big R. In this case, the busiest flight routes are shown on a map. The Big R query to make this visualization possible is as follows: summary(count(.) ~ Origin + Dest, data=airline) Observe this is nothing but counting the flights while grouping by origin and destination. Notice that larger cities such as New York, Houston, Atlanta, San Francisco, Chicago, among others, exhibit the largest flows of flights. Then, the airports need to be geo-coded and some additional pre-processing is required to display the routes on the map. The reader may refer to packages map and geosphere for more details on the generation of this plot. Figure 7. Busiest flight routes in the US. C. Partitioned execution Formerly, we have shown Big R s features as a big data query language and how it can be used to generate insightful visualizations on large datasets. But perhaps the most powerful use case of Big R is partitioned execution. Here, the entire spectrum of R analytics and modeling packages can be shipped to the cluster. Big R s partitioned execution functions are inspired by R s *apply() family and they are able to execute a given function on smaller pieces of

7 a dataset according to certain split criteria. Partitions may be arranged (1) on a group basis (i.e., groupapply()), indicating one or more grouping columns, or (2) on equallysized chunks (i.e., rowapply()), given the number of rows in each partition. In order to illustrate Big R s partitioned execution capabilities, let us build regression tree models for the arrival delay of each airline. First, we create both training and testing sets by means of the bigr.sample() function: > split <- bigr.sample(air, perc=c(0.3, 0.7)) > train <- split[[1]] > test <- split[[2]] 1) Training models: In order to train the models, we will use package rpart, which includes the recursive partitioning and regression tree algorithm. groupapply() requires three arguments: a dataset (i.e., a bigr.frame), a set of grouping columns (e.g., UniqueCarrier), and an R function to invoke. In our case, such function, called buildmodel(), will generate a regression tree on each partition as follows: buildmodel <- function(df) { library(rpart) predcols <- c( ArrDelay, DepDelay, DepTime, CRSArrTime, Distance ) model <- rpart(arrdelay ~., df[,predcols]) return(model) where df is the training set, predcols contains the attributes that will be part of the model, and the formula ArrDelay ~. indicates that the arrival delay will be the attribute to be predicted. The expression df[,predcols] projects solely the columns to be in the model. In order to build all the models, groupapply() is invoked as follows: > models <- groupapply(data = train, groupingcolumns = train\$uniquecarrier, rfunction = buildmodel) > class(models) [1] bigr.list where UniqueCarrier is the grouping column and buildmodel() is the function to be run for each group. Since models could occupy huge amounts of memory (e.g., they could contain regression residuals), groupapply() stores them on HDFS. Note that models is a bigr.list object that provides access to the regression trees and it contains twenty nine elements (i.e., one per airline). One or more models could also be brought to the client for analysis/visualization through function bigr.pull(), specifying a grouping column value as a reference (e.g., models\$ha to retrieve the model for Hawaiian Airlines): > modelha <- bigr.pull(models\$ha) > class(modelha) [1] "rpart" 2) Model scoring: Once the models have been built on the server, they can be used to make predictions. A function that scores the models (i.e., predicts the arrival delay) for each partition (i.e., airline) is presented below: scoremodels <- function(df, models) { library(rpart) carrier <- df\$uniquecarrier[1] model <- bigr.pull(models[carrier]) predictions <- predict(model, df) return(data.frame(carrier, df\$depdelay, df\$arrdelay, predictions)) This function takes two arguments: the testing set df, as a data.frame, and models, abigr.list which contains the trees as rpart objects. Since the testing set will also be split by airline, we can be assured that all rows in df come from the same carrier. After the carrier is identified, the corresponding model is pulled and the predictions are generated by R s method predict(). Finally, a data.frame with four columns (i.e., airline, actual departure delay, actual arrival delay, and predicted arrival delay) is returned. Notice that function scoremodels() does not run locally on the client but it does so on the cluster as groupapply() spawns an R instance per group on the cluster. Finally, groupapply() is executed for the testing set (i.e., test), grouping by test\$uniquecarrier, and invoking the function scoremodels(). A signature argument is passed at the very end to specify the schema of the resulting dataset containing the predictions, which will be materialized on HDFS and returned as a bigr.frame: preds <- groupapply(test, test\$uniquecarrier, scoremodels, signature=data.frame(carrier= Carrier, DepDelay=1.0, ArrDelay=1.0, ArrDelayPred=1.0, stringsasfactors=f) ) Let us explore what is in preds: > class(preds) [1] bigr.frame > head(preds, 5) carrier DepDelay ArrDelay ArrDelayPred 1 HA HA HA HA HA We could also be interested in measuring the quality of the predictions by calculating the root mean square deviation (RMSD). This is easily accomplished with Big R s arithmetic functions and operators, exactly as if preds was a data.frame: > rmsd <- sqrt(sum((preds\$arrdelay - preds\$arrdelaypred) ^ 2) / nrow(preds)) > print(rmsd) [1] In this case, the prediction error was very high, meaning that either the features or the regression algorithm are not the best choice to predict the arrival delay. Our intention with this exercise, however, was not to build a highly accurate model but rather to illustrate how Big R s partitioned execution becomes useful in big data machine learning applications. The reader may also notice that, just as rpart, any other algorithm could be plugged into groupapply()

### Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

### Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

### Oracle Data Miner (Extension of SQL Developer 4.0)

An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

### ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

### Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

### III Big Data Technologies

III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Data processing goes big

Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

### Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

### Cloudera Certified Developer for Apache Hadoop

Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

### VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved

Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2

### Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

### QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

### Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

### INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data

### Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

### Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

### Cloud Computing at Google. Architecture

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

### High Performance Predictive Analytics in R and Hadoop:

High Performance Predictive Analytics in R and Hadoop: Achieving Big Data Big Analytics Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist August 27, 2013 1 Polling Questions 1 & 2 2 Agenda Revolution

### Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

### How to Enhance Traditional BI Architecture to Leverage Big Data

B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

### INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

### Apache Kylin Introduction Dec 8, 2014 @ApacheKylin

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager lukhan@ebay.com @lukehq Yang Li Architect & Tech Leader yangli9@ebay.com Agenda What s Apache Kylin? Tech Highlights Performance

### Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.

Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D. OSCON Data 2011 Flow of this Talk Introductions Hadoop, R and Interfacing Our Prototypes A use case for interfacing

### A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

### Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### Data Mining in the Swamp

WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

### Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Scalable Network Measurement Analysis with Hadoop Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL Outline Motivation Hadoop overview Approach doing the right thing, Avro what worked,

### International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

### Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

### An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

### Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

### W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

### Using In-Memory Computing to Simplify Big Data Analytics

SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

### Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

### Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

### Chase Wu New Jersey Ins0tute of Technology

CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

### A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

### I/O Considerations in Big Data Analytics

Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

### An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

### Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

### International Journal of Innovative Research in Computer and Communication Engineering

FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

### Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

### MapReduce With Columnar Storage

SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

### Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

### A very short Intro to Hadoop

4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

### Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

### E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data

The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

### Testing Big data is one of the biggest

Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

### Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

### How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

### Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

### Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

### A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

A very short talk about Apache Kylin Business Intelligence meets Big Data Fabian Wilckens EMEA Solutions Architect 1 The challenge today 2 Very quickly: OLAP Online Analytical Processing How many beers

### Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

### Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

### IBM InfoSphere BigInsights Enterprise Edition

IBM InfoSphere BigInsights Enterprise Edition Efficiently manage and mine big data for valuable insights Highlights Advanced analytics for structured, semi-structured and unstructured data Professional-grade

### Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

### Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

### What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

### Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

### Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

### Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

### Application Development. A Paradigm Shift

Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

### Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

### IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based

### Big Data: Using ArcGIS with Apache Hadoop

2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.

### Big Data on Microsoft Platform

Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

### Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the