Similar documents
INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

High Performance Computing with Hadoop WV HPC Summer Institute 2014

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

How To Integrate Hadoop With R And Data Analysis With Big Data With Hadooper

Data processing goes big

HDFS. Hadoop Distributed File System

R at the front end and

How To Write A Data Processing Pipeline In R

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Parallel Options for R

Deployment Planning Guide

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Advanced Big Data Analytics with R and Hadoop

Hadoop Parallel Data Processing

Cloud Computing. Chapter Hadoop

A Brief Outline on Bigdata Hadoop

The MapReduce Framework

Hadoop Architecture. Part 1

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Hadoop Streaming. Table of contents

Hadoop Ecosystem B Y R A H I M A.

Map Reduce & Hadoop Recommended Text:

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Introduction to Cloud Computing

A programming model in Cloud: MapReduce

Integrating VoltDB with Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

MapReduce Job Processing

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Cloudera Certified Developer for Apache Hadoop

Distributed Computing and Big Data: Hadoop and MapReduce

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Bringing Big Data Modelling into the Hands of Domain Experts

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Apache Hadoop. Alexandru Costan

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Parquet. Columnar storage for the people

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Hypertable Architecture Overview

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

Hadoop Design and k-means Clustering

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Developing a MapReduce Application

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Report: Declarative Machine Learning on MapReduce (SystemML)

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Extreme computing lab exercises Session one

Bigtable is a proven design Underpins 100+ Google services:

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Extreme Computing. Hadoop MapReduce in more detail.

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Click Stream Data Analysis Using Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Package hive. January 10, 2011

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop WordCount Explained! IT332 Distributed Systems

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Business Intelligence for Big Data

Introduction to Hadoop

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Similarity Search in a Very Large Scale Using Hadoop and HBase

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Big Data With Hadoop

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights

Processing of Hadoop using Highly Available NameNode

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

A Cost-Evaluation of MapReduce Applications in the Cloud

Chapter 7. Using Hadoop Cluster and MapReduce

Constructing a Data Lake: Hadoop and Oracle Database United!

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony

Case Study : 3 different hadoop cluster deployments

Xiaoming Gao Hui Li Thilina Gunarathne

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Integrating Big Data into the Computing Curricula

Fast Analytics on Big Data with H20

NoSQL for SQL Professionals William McKnight

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Extreme computing lab exercises Session one

HADOOP MOCK TEST HADOOP MOCK TEST II

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Introduction to Hadoop

How To Scale Out Of A Nosql Database

PassTest. Bessere Qualität, bessere Dienstleistungen!

Apache Flink Next-gen data analysis. Kostas

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

L1: Introduction to Hadoop

Apache HBase. Crazy dances on the elephant back

Transcription:

Zihang Yin

Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical methods require data sets that are far too large to analyze on local memory. Our assumption is that each analyst should understand R, but have a limited understanding of Hadoop.

Perspectives The R and Hadoop Integrated Programming Environment is R package to compute across massive data sets, create subsets, apply routines to subsets, produce displays on subsets across a cluster of computers using the Hadoop DFS and Hadoop MapReduce framework. This is accomplished from within the R environment, using standard R programming idioms. Enabling the integration of these methods will drive greater analytical productivity and extend the capabilities of companies.

Approach The native language of Hadoop is Java. Java is not suitable for rapid development such as is needed for a data analysis environment. Hadoop Streaming bridges this gap. Users can write MapReduce programs in other languages e.g. Python, Ruby, Perl which is then deployed over the cluster. Hadoop Streaming then transfers the input data from Hadoop to the user program and vice versa. However, Data analysis from R does not involve the user writing code to be deployed from the command line. The analyst has massive data sitting in the background, she needs to create data, partition the data, compute summaries or displays. This need to be evaluated from the R environment and the results returned to R. Ideally not having to resort to the command line.

Solution --- RHIPE RHIPE consist of several functions to interact with the HDFS e.g. save data sets, read data, created by RHIPE MapReduce, delete files. Compose and launch MapReduce jobs from R using the command rhmr and rhex. Monitor the status using rhstatus which returns an R object. Stop jobs using rhkill Compute side effect files. The output of parallel computations may include the creation of PDF files, R data sets, CVS files etc. These will be copied by RHIPE to a central location on the HDFS removing the need for the user to copy them from the compute nodes or setting up a network file system.

Solution --- RHIPE Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C. The serialization format used by RHIPE (converting R objects to binary data) uses Googles Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets. Data sets created using RHIPE are key-value pairs. A key is mapped to a value. A MapReduce computations iterates over the key,value pairs in parallel. If the output of a RHIPE job creates unique keys the output can be treated as a externalmemory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.

RHIPE FUNCTION rhget - Copying from the HDFS rhput - Copying to the HDFS rhwrite - Writing R data to the HDFS rhread - Reading data from HDFS into R rhgetkeys - Reading Values from Map Files

PACKAGING A JOB FOR MAPREDUCE rhex - Submitting a MapReduce R Object to Hadoop rhmr - Creating the MapReduce Object Functions to Communicate with Hadoop during MapReduce rhcollect - Writing Data to Hadoop MapReduce rhstatus - Updating the Status of the Job during Runtime

Setup Using eucalyptus create the hadoop The cluster has one master node and one slave node. The Hadoop version that compatible with RHIPE is R-0.20-2. Installing Google protobuf for searilization Installing R./configure enable-r-shalib Make Make check Make install Installing Rhipe as the add-on package Create an image on eucalyptus thus it saves further efforts.

Example 1 How to make your text file with random numbers make.numbers <- function(n,dest,cols=5,factor=1,local=false){ ## p is how long the word will be, longer more unique words ## factor, if equal to 1, then exactly N rows, otherwise N*factor rows ## cols how many columns per row map <- as.expression(bquote({ COLS <-.(COLS) F <-.(F) lapply(map.values,function(r){ for(i in 1:F){ f <- runif(cols) rhcollect(null,f) } }) },list(cols=cols,f=factor)))

Example 1 How to make your text file with random numbers R Library(Rhipe) mapred <- list() if (local) mapred$mapred.job.tracker <- 'local' mapred[['mapred.field.separator']]=" " mapred[['mapred.textoutputformat.usekey']]=false mapred$mapred.reduce.tasks=0 z <- rhmr(map=map, N=N,ofolder=dest,inout=c("lapp","text"), mapred=mapred) rhex(z) } make.numbers(n=1000, "/tmp/somenumbers",cols=10) ## read them in (don't if N is too large!) f <- rhread("/tmp/somenumbers/", type="text")

How to compute the mean Example 2 Mapper ## You want to compute the mean and sd (is ro == correlation?) For this (and let's ## forget about numerical accuracy), we need the sums and sum of squares of the K ## columns. Using that you can compute mean and sd. map <- expression({ ## K is the number of colums ## the number of rows is the length of map.values ## map.values is a list of lines ## this approach is okay, if you want /all/ the columns K <- 10 l <- length(map.values) all.lines <- as.numeric(unlist(strsplit(unlist(map.values),"[[:space:]]+"))) dim(all.lines) <- c(l, K) ## K is the number of columns sums <- apply(all.lines, 2, sum) ##by columns sqs <- apply(all.lines,2, function(r) sum(r^2)) # by columns sapply(1:k, function(r) rhcollect(r, c(l,sums[r],sqs[r]))) })

How to compute the mean Reducer Example 2 reduce <- expression( pre = { totals <- c(0,0,0)}, reduce = { totals <- totals + apply(do.call('rbind', reduce.values),2,sum) }, post = {rhcollect(reduce.key,totals) } ) ## the mapred bit is optional, but if you have K columns, why run more reducers? mr <- list(mapred.reduce.tasks=k) y <- rhmr(map=map, reduce=reduce,combiner=true,inout=c("text","sequence"),ifolder="/tmp/somenumbers", ofolder="/tmp/means",mapred=mr) w <- rhex(y,async=true) z <- rhstatus(w, mon.sec=5) results <- if(z$state=="succeeded") rhread("/tmp/means") else NULL if(!is.null(results)){ results <- cbind(unlist(lapply(results,"[[",1)), do.call('rbind',lapply(results,"[[",2))) colnames(results) <- c("col.num","n","sum","ssq") }

Conclusion In summary, the objective of RHIPE is to let the user focus on thinking about the data. The difficulties in distributing computations and storing data across a cluster are automatically handled by RHIPE and Hadoop.