INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Similar documents

How To Integrate Hadoop With R And Data Analysis With Big Data With Hadooper

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Case Study : 3 different hadoop cluster deployments

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Map Reduce & Hadoop Recommended Text:

Large scale processing using Hadoop. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Big Data for Processing Data and Performing Workload

RHadoop Installation Guide for Red Hat Enterprise Linux

Testing 3Vs (Volume, Variety and Velocity) of Big Data

A Brief Outline on Bigdata Hadoop

Big Data and Data Science: Behind the Buzz Words

MapReduce. Tushar B. Kute,

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Workshop on Hadoop with Big Data

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

BIG DATA ANALYSIS USING RHADOOP

Parallel Options for R

How To Scale Out Of A Nosql Database

Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.

Hadoop IST 734 SS CHUNG

Chase Wu New Jersey Ins0tute of Technology

HDFS. Hadoop Distributed File System

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

How To Create A Data Visualization With Apache Spark And Zeppelin

Open source Google-style large scale data analysis with Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

ITG Software Engineering

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Big Data Explained. An introduction to Big Data Science.

Hadoop Architecture. Part 1

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Chapter 7. Using Hadoop Cluster and MapReduce

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Implement Hadoop jobs to extract business value from large and varied data sets

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Introduction to MapReduce and Hadoop

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop. Sunday, November 25, 12

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

CSE-E5430 Scalable Cloud Computing Lecture 2

Click Stream Data Analysis Using Hadoop

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

HADOOP MOCK TEST HADOOP MOCK TEST II

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Big Data on Cloud Computing- Security Issues

Cloudera Certified Developer for Apache Hadoop

Big Data on Microsoft Platform

Hadoop Ecosystem B Y R A H I M A.

How Companies are! Using Spark

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Reduction of Data at Namenode in HDFS using harballing Technique

Big Data Analytics OverOnline Transactional Data Set

Big Data and Apache Hadoop s MapReduce

Big Data for the JVM developer. Costin Leau,

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Internals of Hadoop Application Framework and Distributed File System

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)

Data-Intensive Computing with Map-Reduce and Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data: Tools and Technologies in Big Data

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

MapReduce, Hadoop and Amazon AWS

Alternatives to HIVE SQL in Hadoop File Structure

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

How To Use Hadoop

Microsoft SQL Server 2012 with Hadoop

BIG DATA What it is and how to use?

Comprehensive Analytics on the Hortonworks Data Platform

Hadoop Technology for Flow Analysis of the Internet Traffic

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

COURSE CONTENT Big Data and Hadoop Training

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Enabling High performance Big Data platform with RDMA

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Transcription:

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies,

BIG DATA The term big data was defined as data sets of increasing volume, velocity and variety 3V; Big data sizes are ranging from a few hundreds terabytes to many petabytes of data in a single data set; Requires high computing power and large storage devices.

BIG DATA AN OFFICIAL STATISTICS Official statistics need to harness the potential of big data to derive more relevant and timely statistics; Large data sources that can be used in official statistics are: Administrative data; Commercial or transactional data; Data provided by sensors; Data provided by tracking devices; Behavioral data (for example Internet searches); Data provided by social media.

BIG DATA AND OFFICIAL STATISTICS Challenges: legislative issues; maintaining the privacy of the data; financial problems regarding the cost of sourcing data; data quality and suitability of statistical methods; technological challenges In this paper we will investigate a technological problem: integrating R and Hadoop

R Is a free software package for statistics and data visualization; Is available for UNIX, Windows and MacOS; R is used as a computational platform for regular statistics production in many official statistics agencies; It is used in many other sectors like finance, retail, manufacturing etc.

HADOOP Is a free software framework developed for distributed processing of large data sets using clusters of commodity hardware; It was developed in Java; Other languages could be used to: R, Python or Ruby; Available at http://hadoop.apache.org/.

HADOOP The Hadoop framework includes: Hadoop Distributed File System (HDFS); Hadoop YARN -a framework for job scheduling and cluster resource management; Hadoop MapReduce a system for parallel processing of large data sets;

HADOOP The main features of the Hadoop framework can be summarized as follows: High degree of scalability; Cost effective: it allows for massively parallel computing using commodity hardware; Flexibility: is able to use any type of data, structured or not; Fault tolerance.

R AND HADOOP R and Streaming; Rhipe; Rhadoop;

R AND STREAMING Allows users to run Map/Reduce jobs with any script or executable that can access standard input/standard output; No client-side integration with R;

R AND STREAMING AN EXAMPLE $ ${HADOOP_HOME}/bin/Hadoop jar ${HADOOP_HOME}/contrib/streaming/*.jar \ -inputformat org.apache.hadoop.mapred.textinputformat \ -input input_data.txt \ -output output \ -mapper /home/tst/src/map.r \ -reducer /home/tst/src/reduce.r \ -file /home/tst/src/map.r \ -file /home/tst/src/reduce.r

R AND HADOOP The integration of R and Hadoop using Streaming is an easy task; Requires that R should be installed on every DataNode of the Hadoop cluster ;

RHIPE Rhipe = R and Hadoop Integrated Programming Environment ; Provides a tight integration between R and Hadoop; Allows the user to carry out data analysis of big data directly in R; Available at www.datadr.org.

RHIPE Rhipe is an R library which allows running a MapReduce job within R; Install requirements: R on each Data Node; Protocol Buffers on each Data Node; Rhipe on each Data Node;

RHIPE AN EXAMPLE library(rhipe) rhinit(true, TRUE); map<-expression ( {lapply (map.values, function(mapper) )}) reduce<-expression( pre = { }, reduce = { }, post = { }, ) x <- rhmr(map=map, reduce=reduce, ifolder=inputpath, ofolder=outputpath, inout=c('text', 'text'), jobname='a job name')) rhex(z)

RHADOOP RHadoop is an open source project developed by Revolution Analytics; allows running a MapReduce jobs within R just like Rhipe; Consists in: plyrmr -providing common data manipulation operations on very large data sets managed by Hadoop; rmr a collection of functions providing and integration of R and MapReduce; rdfs an interface between R and HDFS; rhbase - an interface between R and HBase;

RHADOOP AN EXAMPLE library(rmr) map<-function(k,v) { } reduce<-function(k,vv) { } mapreduce( input = data.txt, output= output, textinputformat =rawtextinputformat, map = map, reduce=reduce )

CONCLUSIONS Each of the approaches has benefits and limitations: R with Streaming raises no problems regarding installation; Rhipe and RHadoop requires some effort to set up the cluster; The integration with R from the client side is high for Rhipe and Rhadoop and is missing for R and Streaming. Rhipe and RHadoop allows users to define and call their own map and reduce functions within R; Streaming uses a command line approach where the map and reduce functions are passed as arguments. There are other alternatives for large scale data analysis: Apache Mahout, Apache Hive, commercial versions of R provided by Revolution Analytics, Segue framework or ORCH;