Package HadoopStreaming



Similar documents
Package hive. January 10, 2011

Package hive. July 3, 2015

Cloud Computing. Chapter Hadoop

Package retrosheet. April 13, 2015

Package sjdbc. R topics documented: February 20, 2015

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

Package fimport. February 19, 2015

Reading and writing files

Hadoop Streaming. Table of contents

Package TSfame. February 15, 2013

Cloudera Certified Developer for Apache Hadoop

Package plan. R topics documented: February 20, 2015

Package uptimerobot. October 22, 2015

Package RCassandra. R topics documented: February 19, Version Title R/Cassandra interface

Package packrat. R topics documented: March 28, Type Package

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

MapReduce. Tushar B. Kute,

Package tagcloud. R topics documented: July 3, 2015

Package PKI. July 28, 2015

Package TimeWarp. R topics documented: April 1, 2015

Assignment 2: More MapReduce with Hadoop

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Programming Languages CIS 443

OTN Developer Day: Oracle Big Data

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Data Storage STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Extreme computing lab exercises Session one

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Package hoarder. June 30, 2015

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

Package sendmailr. February 20, 2015

Big Data and Scripting map/reduce in Hadoop

Basics of I/O Streams and File I/O

How To Use Hadoop

Scanner. It takes input and splits it into a sequence of tokens. A token is a group of characters which form some unit.

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Data Intensive Computing Handout 6 Hadoop

1. To start Installation: To install the reporting tool, copy the entire contents of the zip file to a directory of your choice. Run the exe.

Data Intensive Computing Handout 5 Hadoop

Extreme computing lab exercises Session one

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

csce4313 Programming Languages Scanner (pass/fail)

Package missforest. February 20, 2015

Introduction to Python

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

Distributed Text Mining with tm

The Hadoop Eco System Shanghai Data Science Meetup

Package treemap. February 15, 2013

Package EasyHTMLReport


Package erp.easy. September 26, 2015

Offline Image Viewer Guide

CLC Server Command Line Tools USER MANUAL

Simple Parallel Computing in R Using Hadoop

Package empiricalfdr.deseq2

Introduction to Python

Rumen. Table of contents

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Project 5 Twitter Analyzer Due: Fri :59:59 pm

Lesson 7 Pentaho MapReduce

Advanced Bash Scripting. Joshua Malone

HADOOP MOCK TEST HADOOP MOCK TEST II

Integrating VoltDB with Hadoop

Taking R to the Limit: Part II - Large Datasets

Hadoop and Map-reduce computing

Together with SAP MaxDB database tools, you can use third-party backup tools to backup and restore data. You can use third-party backup tools for the

CRM Rules! User Guide. Version Prepared October, 2012 By: David L. Carr, President, Visionary Software

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

SQL Server An Overview

Exercise 1: Python Language Basics

Package GEOquery. August 18, 2015

ESPResSo Summer School 2012

13 File Output and Input

Getting to know Apache Hadoop

Ubuntu. Ubuntu. C++ Overview. Ubuntu. History of C++ Major Features of C++

GUI Input and Output. Greg Reese, Ph.D Research Computing Support Group Academic Technology Services Miami University

Writing R packages. Tools for Reproducible Research. Karl Broman. Biostatistics & Medical Informatics, UW Madison

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Programming Project 1: Lexical Analyzer (Scanner)

Impala Introduction. By: Matthew Bollinger

Sybase Adaptive Server Enterprise

Lab 4: Socket Programming: netcat part

System Calls and Standard I/O

Introduction to the data.table package in R

TinyUrl (v1.2) 1. Description. 2. Configuration Commands

Hands-on Exercises with Big Data

HYPERION DATA RELATIONSHIP MANAGEMENT RELEASE BATCH CLIENT USER S GUIDE

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

Eventia Log Parsing Editor 1.0 Administration Guide

mrjob Documentation Release dev Steve Johnson

TCP/IP Networking, Part 2: Web-Based Control

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

Using RADIUS Agent for Transparent User Identification

IceWarp to IceWarp Server Migration

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Transcription:

Package HadoopStreaming February 19, 2015 Type Package Title Utilities for using R scripts in Hadoop streaming Version 0.2 Date 2009-09-28 Author David S. Rosenberg <drosen@sensenetworks.com> Maintainer David S. Rosenberg <drosen@sensenetworks.com> Depends getopt Description Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates operating on data in a streaming fashion, without Hadoop. License GPL Repository CRAN Date/Publication 2012-10-29 08:57:09 NeedsCompilation no R topics documented: HadoopStreaming-package................................. 2 hscmdlineargs....................................... 4 hskeyvalreader....................................... 5 hslinereader........................................ 6 hstablereader........................................ 7 hswritetable........................................ 10 Index 12 1

2 HadoopStreaming-package HadoopStreaming-package Functions facilitating Hadoop streaming with R. Description Details Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates operating on data in a streaming fashion, without Hadoop. Package: HadoopStreaming Type: Package Version: 0.2 Date: 2009-09-28 License: GNU LazyLoad: yes The functions in this package read data in chunks from a file connection (stdin when used with Hadoop streaming), package up the chunks in various ways, and pass the packaged versions to user-supplied functions. There are 3 functions for reading data: 1. hstablereader is for reading data in table format (i.e. columns separated by a separator character) 2. hskeyvalreader is for reading key/value pairs, where each is a string 3. hslinereader is for reading entire lines as strings, without any data parsing. Only hstablereader will break the data into chunks comprising all rows of the same key. This assumes that all rows with the same key are stored consecutively in the input file. This is always the case if the input file is taken to be the stdin provided by Hadoop in a Hadoop streaming job, since Hadoop guarantees that the rows given to the reducer are sorted by key. When running from the command line (not in Hadoop), we can use the sort utility to sort the keys ourselves. In addition to the data reading functions, the function hscmdlineargs offers several default command line arguments for doing things such as specifying an input file, the number of lines of input to read, the input and output column separators, etc. The hscmdlineargs function also facilitates packaging both the mapper and reducer scripts into a single R script by accepting arguments mapper and reducer to specify whether the call to the script should execute the mapper branch or the reducer. The examples below give a bit of support code for using the functions in this package. Details on using the functions themselves can be found in the documentation for those functions. For a full demo of running a map/reduce script from the command line and in Hadoop, see the directory <RLibraryPath>/HadoopStreaming/wordCntDemo/ and the README file there.

HadoopStreaming-package 3 Author(s) David S. Rosenberg <drosen@sensenetworks.com> Examples ## STEP 1: MAKE A CONNECTION ## To read from STDIN (used for deployment in Hadoop streaming and for command line testing) con = file(description="stdin",open="r") ## Reading from a text string: useful for very small test examples str <- "Key1\tVal1\nKey2\tVal2\nKey3\tVal3\n" cat(str) ## Reading from a file: useful for testing purposes during development cat(str,file="datafile.txt") # write datafile.txt data in str con <- file("datafile.txt",open="r") ## To get the first few lines of a file (also very useful for testing) numlines = 2 con <- pipe(paste("head -n",numlines, datafile.txt ), "r") ## STEP 2: Write map and reduce scripts, call them mapper.r and ## reducer.r. Alternatively, write a single script taking command line ## flags specifying whether it should run as a mapper or reducer. The ## hscmdlineargs function can assist with this. ## Writing #!/usr/bin/env Rscript can make an R file executable from the command line. ## STEP 3a: Running on command line with separate mappers and reducers ## cat inputfile Rscript mapper.r sort Rscript reducer.r ## OR ## cat inputfile R --vanilla --slave -f mapper.r sort R --vanilla --slave -f reducer.r ## STEP 3b: Running on command line with the recommended single file ## approach using Rscript and the hscmdlineargs for argument parsing. ## cat inputfile./mapreduce.r --mapper sort./mapreduce.r --reducer ## STEP 3c: Running in Hadoop -- Assuming mapper.r and reducer.r can ## run on each computer in the cluster: ## $HADOOP_HOME/bin/hadoop $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \ ## -input inpath -output outpath -mapper \ ## "R --vanilla --slave -f mapper.r" -reducer "R --vanilla --slave -f reducer.r" \ ## -file./mapper.r -file./reducer.r ## STEP 3d: Running in Hadoop, with the recommended single file method: ## $HADOOP_HOME/bin/hadoop $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \ ## -input inpath -output outpath -mapper \ ## "mapreduce.r --mapper" -reducer "mapreduce.r --reducer" \ ## -file./mapreduce.r

4 hscmdlineargs hscmdlineargs Handles command line arguments for Hadoop streaming tasks Description Usage Offers several command line arguments useful for Hadoop streaming. Allows specifying input and output files, column separators, and much more. Optionally opens the I/O connections. hscmdlineargs(spec=c(),openconnections=true,args=commandargs(true)) Arguments spec A vector specifying the command line args to support. openconnections A boolean specifying whether to open the I/O connections. args Details Character vector of arguments. Defaults to command line args. The spec vector has length 6*n, where n is the number of command line arguments specified. The spec has the same format as the spec parameter in the getopt function of the getopt package, though we have one additional entry specifying a defaut value. The six entries per argument are the following: 1. long flag name (a multi-character string) 2. short flag name (a single character) 3. Argument specification: 0=no arg, 1=required arg, 2=optional arg 4. Data type ( logical, integer, double, complex, or character ) 5. A string describing the option 6. The default value to be assigned to this parameter See getopt in getopt.package for details. The following vector defines the default command line args. The vector is appended to the usersupplied spec vector in the call to getopt. basespec = c( mapper, m,0, "logical","runs the mapper.",f, reducer, r,0, "logical","runs the reducer, unless already running mapper.",f, mapcols, a,0, "logical","prints column headers for mapper output.",f, reducecols, b,0, "logical","prints column headers for reducer output.",f, infile, i,1, "character","specifies an input file, otherwise use stdin.",na, outfile, o,1, "character","specifies an output file, otherwise use stdout.",na, skip, s,1,"numeric","number of lines of input to skip at the beginning.",0,

hskeyvalreader 5 chunksize, C,1,"numeric","Number of lines to read at once, a la scan.",-1, numlines, n,1,"numeric","max num lines to read per mapper or reducer job.",0, sepr, e,1,"character","separator character, as used by scan.", \t, insep, f,1,"character","separator character for input, defaults to sepr.",na, outsep, g,1,"character","separator character output, defaults to sepr.",na, help, h,0,"logical","get a help message.",f ) Value Returns a list. The names of the entries in the list are the long flag names. Their values are either those specified on the command line, or the default values. If openconnections=true, then the returned list has two additional entries: incon and outcon. incon is a readable connection to the input source specified, and outcon is a writable connection to the appropriate output destination. An additional entry in the returned list is named set. When this list entry is FALSE, none of the options were set (generally because -h or help was requested). The calling procedure should probably stop execution when the set is returned as FALSE. Author(s) See Also David S. Rosenberg <drosen@sensenetworks.com> This package relies heavily on package getopt Examples spec = c( mychunksize, C,1,"numeric","Number of lines to read at once, a la scan.",-1) ## Displays the help string hscmdlineargs(spec, args=c( -h )) ## Call with the mapper flag, and request that connections be opened opts = hscmdlineargs(spec, openconnections=true,args=c( -m )) opts # a list of argument values opts$incon # an input connection opts$outcon # an output connection hskeyvalreader Reads key value pairs Description Uses scan to read in chunksize lines at a time, where each line consists of a key string and a value string. The first skip lines of input are skipped. Each group of key/value pairs are passed to FUN as a character vector of keys and character vector of values.

6 hslinereader Usage hskeyvalreader(file = "", chunksize = -1, skip = 0, sep = "\t",fun = function(k, v) cat(paste(k, v, sep Arguments file chunksize skip FUN sep A connection object or a character string, as in scan. The (maximal) number of lines to read at a time. The default is -1, which specifies that the whole file should be read at once. Number of lines to ignore at the beginning of the file A function that takes a character vector as input The character separating the key and the value strings. Value No return value. Author(s) David S. Rosenberg. <<drosen@sensenetworks.com>> Examples printfn <- function(k,v) { cat( A chunk:\n ) cat(paste(k,v,sep= : ),sep= \n ) str <- "key1\tval1\nkey2\tval2\nkey3\tval3\n" cat(str) hskeyvalreader(con,chunksize=2,fun=printfn) hskeyvalreader(con,fun=printfn) hslinereader A wrapper for readlines Description This function repeatedly reads chunksize lines of data from file and passes a character vector of these strings to FUN. The first skip lines of input are ignored. Usage hslinereader(file = "", chunksize = -1, skip = 0, FUN = function(x) cat(x, sep = "\n"))

hstablereader 7 Arguments file chunksize skip FUN A connection object or a character string, as in readlines. The (maximal) number of lines to read at a time. The default is -1, which specifies that the whole file should be read at once. Number of lines to ignore at the beginning of the file A function that takes a character vector as input Details Warning: A feature(?) of readlines is that if there is a newline before the EOF, an extra empty string is returned. Value No return value. Author(s) David S. Rosenberg. <<drosen@sensenetworks.com>> Examples str <- "Hello here are some\nlines of text\nto read in, chunksize\nlines at a time.\nhow interesting.\nhuh?" cat(str) hslinereader(con,chunksize=-1,fun=print) hslinereader(con,chunksize=3,skip=1,fun=print) hstablereader Chunks input data into data frames Description This function repeatedly reads chunks of data from an input connection, packages the data as a data.frame, optionally ensures that all the rows for certain keys are contained in the data.frame, and passes the data.frame to a handler for processing. This continues until the end of file. Usage hstablereader(file="",cols= character,chunksize=-1,fun=print, ignorekey=true,singlekey=true, skip=0, sep= \t, keycol= key, PFUN=NULL,carryMemLimit=512e6,carryMaxRows=Inf, stringsasfactors=false,debug=false)

8 hstablereader Arguments file cols chunksize FUN ignorekey singlekey skip sep keycol PFUN carrymemlimit Any file specification accepted by scan A list of column names, as accepted by the what arg to scan Number of lines to read at a time A function accepting a dataframe with columns given by cols If TRUE, always passes chunksize rows to FUN, regardless of whether the chunk has only some of the rows for a given key. If TRUE, the singlekey arg is ignored. If TRUE, then each data frame passed to FUN will contain all rows corresponding to a single key. If FALSE, then will contain several complete keys. Number of lines to skip at the beginning of the file. Any separator character accepted by scan The column name of the column with the keys. Same as FUN, except handles incomplete keys. See below. Max memory used for values of a single key carrymaxrows Max number of values allowed for a key. stringsasfactors Whether strings converted to factors. debug Whether to print debug messages. Details With ignorekey=true, hstablereader reads from file, chunksize lines at a time, packages the lines into a data.frame, and passes the data.frame to FUN. This mode would most commonly be used for the MAPPER job in Hadoop. Everything below pertains to ignorekey=false. With ignorekey=false, hstablereader breaks up the data read from file into chunks comprising all rows of the same key. This ASSUMES that all rows with the same key are stored consecutively in the input file. (This is always the case if the input file is taken to be the STDIN pipe to a Hadoop reducer.) We first disucss the case of PFUN=NULL. This is the recommended setting when all the values for a given key can comfortably fit in memory. When singlekey=true, FUN is called with all the rows for a single key at a time. When singlekey=false, FUN made be called with rows corresponding to multiple keys, but we guarantee that the data.frame contains all the rows corresponding to any key that appears in the data.frame. When PFUN!= NULL, hstablereader does not wait to collect all rows for a given key before passing the data to FUN. When hstablereader reads the first chunk of rows from file, it first passes all the rows of complete keys to FUN. Then, it passes the rows corresponding to the last key, call it PREVKEY, to PFUN. These rows may or may not consist of all rows corresponding to PREVKEY. Then, hstablereader continues to read chunks from file and pass them to PFUN until it reaches a a new key (i.e. different from PREVKEY). At this point, PFUN is called with an empty data.frame to indicate that it is done handling (key,value) pairs for PREVKEY. Then, as with the first chunk,

hstablereader 9 Value any complete keys left in the chunk are passed to FUN and the incomplete key is passed to PFUN. The process continues until the end of file. By using a PFUN function, we can process more values for a given key than can fit into memory. See below for examples of using PFUN. No return value. Author(s) David S. Rosenberg <drosen@sensenetworks.com> Examples ## This function is useful as a reader for Hadoop reduce scripts str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\t10.1\nkey3\t1.0\nk cat(str) cols = list(key=,val=0) hstablereader(con,cols,chunksize=6,fun=print,ignorekey=true) hstablereader(con,cols,chunksize=6,fun=print,ignorekey=false,singlekey=true) hstablereader(con,cols,chunksize=6,fun=print,ignorekey=false,singlekey=false) ## The next two examples compute the mean, by key, in 2 different ways reducerkeyatonce <-function(d) { key = d[1, key ] ave = mean(d[, val ]) cat(key,ave, \n,sep= \t ) hstablereader(con,cols,chunksize=6,fun=reducerkeyatonce,ignorekey=false,singlekey=true) reducermanycompletekeys <-function(d) { a=aggregate(d$val,by=list(d$key),fun=mean) write.table(a,quote=false,sep= \t,row.names=false,col.names=false) hstablereader(con,cols,chunksize=6,fun=reducermanycompletekeys,ignorekey=false,singlekey=false) ### ADVANCED: When we have more values for each key than can fit in memory ## Test example to see how the input is broken up reducerfullkeys <- function(d) {

10 hswritetable print("processing complete keys.") print(d) reducerpartialkey <- function(d) { if (nrow(d)==0) { print("done with partial key") else { print("processing partial key...") print(d) hstablereader(con,cols,chunksize=5,fun=reducerfullkeys,ignorekey=false,singlekey=false,pfun=reducerpartialkey ## Repeats the mean example, with partial key processing partialsum = 0 partialcnt = 0 partialkey = NA reducerpartialkey <- function(d) { if (nrow(d)==0) { ## empty data.frame indicates that we have seen all rows for the previous key ave = partialsum / partialcnt cat(partialkey,ave, \n,sep= \t ) partialsum <<- 0 partialcnt <<- 0 partialkey <<- NA else { if (is.na(partialkey)) partialkey <<- d[1, key ] partialsum <<- partialsum + sum(d[, val ]) partialcnt <<- partialcnt + nrow(d) hstablereader(con,cols,chunksize=6,fun=reducerkeyatonce,ignorekey=false,singlekey=true,pfun=reducerpartialkey hstablereader(con,cols,chunksize=6,fun=reducermanycompletekeys,ignorekey=false,singlekey=false,pfun=reducerpa hswritetable Calls write.table with defaults sensible for Hadoop streaming. Description Usage Calls write.table without row names or column names, without string quotes, and with tab as the default separator. hswritetable(d, file = "", sep = "\t")

hswritetable 11 Arguments d file sep A data frame A connection, as taken by write.table() The column separator, defaults to a tab character Value No return value. Author(s) See Also David S. Rosenberg <drosen@sensenetworks.com> write.table Examples d=data.frame(a=c( hi, yes, no ),b=c(1,2,3)) rownames(d) <- c( row1, row2, row3 ) write.table(d) hswritetable(d)

Index Topic package HadoopStreaming-package, 2 HadoopStreaming (HadoopStreaming-package), 2 HadoopStreaming-package, 2 hscmdlineargs, 4 hskeyvalreader, 5 hslinereader, 6 hstablereader, 7 hswritetable, 10 write.table, 11 12