Package hive. July 3, 2015



Similar documents
Package hive. January 10, 2011

Simple Parallel Computing in R Using Hadoop

Distributed Text Mining with tm

Package HadoopStreaming

Hadoop Streaming. Table of contents

Extreme computing lab exercises Session one

Hadoop Shell Commands

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

Hadoop Shell Commands

File System Shell Guide

HDFS File System Shell Guide

Extreme computing lab exercises Session one

Package retrosheet. April 13, 2015

Important Notice. (c) Cloudera, Inc. All rights reserved.

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop Installation MapReduce Examples Jake Karnes

Package TSfame. February 15, 2013

Developing a MapReduce Application

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Cloudera Certified Developer for Apache Hadoop

Hadoop and Map-reduce computing

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

IDS 561 Big data analytics Assignment 1

Data Domain Profiling and Data Masking for Hadoop

Hadoop Tutorial. General Instructions

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Package sjdbc. R topics documented: February 20, 2015

Package uptimerobot. October 22, 2015

Package RCassandra. R topics documented: February 19, Version Title R/Cassandra interface

HADOOP PERFORMANCE TUNING

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

HDFS. Hadoop Distributed File System

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Introduction to Hadoop

Using Hadoop With Pentaho Data Integration

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Lesson 7 Pentaho MapReduce

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

How To Use Hadoop

Cloud Computing. Chapter Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Getting to know Apache Hadoop

Map Reduce Workflows

Project 5 Twitter Analyzer Due: Fri :59:59 pm

PassTest. Bessere Qualität, bessere Dienstleistungen!

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Apache Hadoop. Alexandru Costan

Distributed Computing with Hadoop

Chase Wu New Jersey Ins0tute of Technology

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User's Guide

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

MapReduce. Tushar B. Kute,

Integrating VoltDB with Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Package PKI. July 28, 2015

Hadoop Training Hands On Exercise

Package httprequest. R topics documented: February 20, 2015

7 Deadly Hadoop Misconfigurations. Kathleen Ting February 2013

Big Data and Scripting map/reduce in Hadoop

Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Data Science Analytics & Research Centre

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

Constructing a Data Lake: Hadoop and Oracle Database United!

Package packrat. R topics documented: March 28, Type Package

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Package fimport. February 19, 2015

Tutorial for Assignment 2.0

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Cloudera Backup and Disaster Recovery

Using Hadoop With Pentaho Data Integration

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Big Data Too Big To Ignore

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee


NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Apache Hadoop new way for the company to store and analyze big data

Cloudera Backup and Disaster Recovery

Comparison of Different Implementation of Inverted Indexes in Hadoop

map/reduce connected components

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Using Object Database db4o as Storage Provider in Voldemort

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Transcription:

Version 0.2-0 Date 2015-07-02 Title Hadoop InteractiVE Package hive July 3, 2015 Description Hadoop InteractiVE facilitates distributed computing via the MapReduce paradigm through R and Hadoop. An easy to use interface to Hadoop, the Hadoop Distributed File System (HDFS), and Hadoop Streaming is provided. License GPL-3 Imports methods, rjava (>= 0.9-3), tools, XML Depends R (>= 2.9.0) Enhances HadoopStreaming SystemRequirements Apache Hadoop >= 2.6.0 (https://hadoop.apache.org/releases.html#download); Obsolete: Hadoop core >= 0.19.1 and <= 1.0.3 or CDH3 (http://www.cloudera.com); standard unix tools (e.g., chmod) NeedsCompilation no Author Ingo Feinerer [aut], Stefan Theussl [aut, cre] Maintainer Stefan Theussl <Stefan.Theussl@R-project.org> Repository CRAN Date/Publication 2015-07-03 15:16:19 R topics documented: configuration........................................ 2 DFS............................................. 3 hive............................................. 6 hive_stream......................................... 7 Index 10 1

2 configuration configuration Managing the Hadoop configuration Description Functions for showing/changing Hadoop configuration. Usage hive_get_parameter( x, henv = hive() ) hive_get_masters( henv = hive() ) hive_get_slaves( henv = hive() ) hive_get_nreducer( henv = hive() ) hive_set_nreducer( n, henv = hive() ) Arguments henv x n An object containing the local Hadoop configuration. A character string naming the parameter in the Hadoop configuration. An integer specifying the number of reducers to be used in hive_stream(). Details The function hive_get_parameter() is used to get parameters from the Hadoop cluster configuration. The functions hive_get_slaves() and hive_get_masters() return the hostnames of the configured nodes in the cluster. The functions hive_get_nreducer() and hive_set_nreducer() are used to get/set the number of reducers which are used in Hadoop Streaming using hive_stream(). Value hive_get_parameter() returns the specified parameter as a character string. hive_get_slaves() returns a character vector naming the hostnames of the configured worker nodes in the cluster. hive_get_masters() returns a character vector of the hostnames of the configured master nodes in the cluster. hive_get_nreducer() returns an integer representing the number of configured reducers. Author(s) Stefan Theussl

DFS 3 References Apache Hadoop cluster configuration (https://hadoop.apache.org/docs/current/hadoop-project-dist/ hadoop-common/clustersetup.html#configuring_hadoop_in_non-secure_mode). Examples ## Which tmp directory is set in the Hadoop configuration? ## Not run: hive_get_parameter("hadoop.tmp.dir") ## The master nodes of the cluster ## Not run: hive_get_masters() ## The worker nodes of the cluster ## Not run: hive_get_slaves() ## The number of configured reducers ## Not run: hive_get_nreducer() DFS Hadoop Distributed File System Description Functions providing high-level access to the Hadoop Distributed File System (HDFS). Usage DFS_cat( file, con = stdout(), henv = hive() ) DFS_delete( file, recursive = FALSE, henv = hive() ) DFS_dir_create( path, henv = hive() ) DFS_dir_exists( path, henv = hive() ) DFS_dir_remove( path, recursive = TRUE, henv = hive() ) DFS_file_exists( file, henv = hive() ) DFS_get_object( file, henv = hive() ) DFS_read_lines( file, n = -1L, henv = hive() ) DFS_rename( from, to, henv = hive() ) DFS_list( path = ".", henv = hive() ) DFS_tail( file, n = 6L, size = 1024L, henv = hive() ) DFS_put( files, path = ".", henv = hive() ) DFS_put_object( obj, file, henv = hive() ) DFS_write_lines( text, file, henv = hive() )

4 DFS Arguments henv file files n obj path recursive size text con from to An object containing the local Hadoop configuration. a character string representing a file on the DFS. a character string representing files located on the local file system to be copied to the DFS. an integer specifying the number of lines to read. an R object to be serialized to/from the DFS. a character string representing a full path name in the DFS (without the leading hdfs://); for many functions the default corresponds to the user s home directory in the DFS. logical. Should elements of the path other than the last be deleted recursively? an integer specifying the number of bytes to be read. Must be sufficiently large otherwise n does not have the desired effect. a (vector of) character string(s) to be written to the DFS. A connection to be used for printing the output provided by cat. Default: standard output connection, has currently no other effect a character string representing a file or directory on the DFS to be renamed. a character string representing the new filename on the DFS. Details The Hadoop Distributed File System (HDFS) is typically part of a Hadoop cluster or can be used as a stand-alone general purpose distributed file system (DFS). Several high-level functions provide easy access to distributed storage. DFS_cat is useful for producing output in user-defined functions. It reads from files on the DFS and typically prints the output to the standard output. Its behaviour is similar to the base function cat. DFS_dir_create creates directories with the given path names if they do not already exist. It s behaviour is similar to the base function dir.create. DFS_dir_exists and DFS_file_exists return a logical vector indicating whether the directory or file respectively named by its argument exist. See also function file.exists. DFS_dir_remove attempts to remove the directory named in its argument and if recursive is set to TRUE also attempts to remove subdirectories in a recursive manner. DFS_list produces a character vector of the names of files in the directory named by its argument. DFS_read_lines is a reader for (plain text) files stored on the DFS. It returns a vector of character strings representing lines in the (text) file. If n is given as an argument it reads that many lines from the given file. It s behaviour is similar to the base function readlines. DFS_put copies files named by its argument to a given path in the DFS. DFS_put_object serializes an R object to the DFS. DFS_write_lines writes a given vector of character strings to a file stored on the DFS. It s behaviour is similar to the base function writelines.

DFS 5 Value DFS_delete(), DFS_dir_create(), and DFS_dir_remove return a logical value indicating if the operation succeeded for the given argument. DFS_dir_exists() and DFS_file_exists() return TRUE if the named directories or files exist in the HDFS. DFS_get object() returns the deserialized object stored in a file on the HDFS. DFS_list() returns a character vector representing the directory listing of the corresponding path on the HDFS. DFS_read_lines() returns a character vector of length the number of lines read. DFS_tail() returns a character vector of length the number of lines to read until the end of a file on the HDFS. Author(s) Stefan Theussl References The Hadoop Distributed File System (https://hadoop.apache.org/docs/current/hadoop-project-dist/ hadoop-hdfs/hdfsdesign.html). Examples ## Do we have access to the root directory of the DFS? ## Not run: DFS_dir_exists("/") ## Some self-explanatory DFS interaction ## Not run: DFS_list( "/" ) DFS_dir_create( "/tmp/test" ) DFS_write_lines( c("hello HDFS", "Bye Bye HDFS"), "/tmp/test/hdfs.txt" ) DFS_list( "/tmp/test" ) DFS_read_lines( "/tmp/test/hdfs.txt" ) ## Serialize an R object to the HDFS ## Not run: foo <- function() "You got me serialized." sro <- "/tmp/test/foo.sro" DFS_put_object(foo, sro) DFS_get_object( sro )() ## finally (recursively) remove the created directory ## Not run: DFS_dir_remove( "/tmp/test" )

6 hive hive Hadoop Interactive Framework Control Description Usage High-level functions to control Hadoop framework. hive( new ).hinit( hadoop_home ) hive_start( henv = hive() ) hive_stop( henv = hive() ) hive_is_available( henv = hive() ) Arguments hadoop_home henv new A character string pointing to the local Hadoop installation. If not given, then.hinit() will search the default installation directory (given via the environment variable HADOOP_HOME, or /etc/hadoop, respectively). An object containing the local Hadoop configuration. An object specifying the Hadoop environment. Details Value High-level functions to control Hadoop framework. The function hive() is used to get/set the Hadoop cluster object. This object consists of an environment holding information about the Hadoop cluster. The function.hinit() is used to initialize a Hadoop cluster. It retrieves most configuration options via searching the HADOOP_HOME directory given as an environment variable, or, alternatively, by searching the /etc/hadoop directory in case the http://www.cloudera.com distribution (i.e., CDH3) is used. The functions hive_start() and hive_stop() are used to start/stop the Hadoop framework. The latter is not applicable for system-wide installations like CDH3. The function hive_is_available() is used to check the status of a Hadoop cluster. hive() returns an object of class "hive" representing the currently used cluster configuration. hive_is_available() returns TRUE if the given Hadoop framework is running. Author(s) Stefan Theussl

hive_stream 7 References Apache Hadoop core: http://hadoop.apache.org/core/. Cloudera s distribution including Apache Hadoop (CDH): http://www.cloudera.com/. Examples ## read configuration and initialize a Hadoop cluster: ## Not run: h <-.hinit( "/etc/hadoop" ) ## Not run: hive( h ) ## Start hadoop cluster: ## Not run: hive_start() ## check the status of an Hadoop cluste: ## Not run: hive_is_available() ## return cluster configuration h : hive() ## Stop hadoop cluster: ## Not run: hive_stop() hive_stream Hadoop Streaming with package hive Description High-level R function for using Hadoop Streaming. Usage hive_stream( mapper, reducer, input, output, henv = hive(), mapper_args = NULL, reducer_args = NULL, cmdenv_arg = NULL, streaming_args = NULL) Arguments mapper reducer input output henv mapper_args reducer_args cmdenv_arg a function which is executed on each worker node. The so-called mapper typically maps input key/value pairs to a set of intermediate key/value pairs. a function which is executed on each worker node. The so-called reducer reduces a set of intermediate values which share a key to a smaller set of values. If no reducer is used leave empty. specifies the directory holding the data in the DFS. specifies the output directory in the DFS containing the results after the streaming job finished. Hadoop local environment. additional arguments to the mapper. additional arguments to the reducer. additional arguments passed as environment variables to distributed tasks. streaming_args additional arguments passed to the Hadoop Streaming utility. By default, only the number of reducers will be set using "-D mapred.reduce.tasks=".

8 hive_stream Details The function hive_stream() starts a MapReduce job on the given data located on the HDFS. Author(s) Stefan Theussl References Apache Hadoop Streaming (https://hadoop.apache.org/docs/current/hadoop-streaming/ HadoopStreaming.html). Examples ## A simple word count example ## Put some xml files on the HDFS: ## Not run: DFS_put( system.file("defaults/core/", package = "hive"), "/tmp/input" ) ## Not run: DFS_put( system.file("defaults/hdfs/hdfs-default.xml", package = "hive"), "/tmp/input" ) ## Not run: DFS_put( system.file("defaults/mapred/mapred-default.xml", package = "hive"), "/tmp/input" ) ## Define the mapper and reducer function to be applied: ## Note that a Hadoop map or reduce job retrieves data line by line from stdin. ## Not run: mapper <- function(x){ con <- file( "stdin", open = "r" ) while (length(line <- readlines(con, n = 1L, warn = FALSE)) > 0) { terms <- unlist(strsplit(line, " ")) terms <- terms[nchar(terms) > 1 ] if( length(terms) ) cat( paste(terms, 1, sep = "\t"), sep = "\n") } } reducer <- function(x){ env <- new.env( hash = TRUE ) con <- file( "stdin", open = "r" ) while (length(line <- readlines(con, n = 1L, warn = FALSE)) > 0) { keyvalue <- unlist( strsplit(line, "\t") ) if( exists(keyvalue[1], envir = env, inherits = FALSE) ){ assign( keyvalue[1], get(keyvalue[1], envir = env) + as.integer(keyvalue[2]), envir = env ) } else { assign( keyvalue[1], as.integer(keyvalue[2]), envir = env ) } } env <- as.list(env) for( term in names(env) )

hive_stream 9 writelines( paste(c(term, env[[term]]), collapse ="\t") ) } hive_set_nreducer(1) hive_stream( mapper = mapper, reducer = reducer, input = "/tmp/input", output = "/tmp/output" ) DFS_list("/tmp/output") head( DFS_read_lines("/tmp/output/part-00000") ) ## Don t forget to clean file system ## Not run: DFS_dir_remove("/tmp/input") ## Not run: DFS_dir_remove("/tmp/output")

Index.hinit (hive), 6 configuration, 2 DFS, 3 DFS_cat (DFS), 3 DFS_delete (DFS), 3 DFS_dir_create (DFS), 3 DFS_dir_exists (DFS), 3 DFS_dir_remove (DFS), 3 DFS_file_exists (DFS), 3 DFS_get_object (DFS), 3 DFS_list (DFS), 3 DFS_put (DFS), 3 DFS_put_object (DFS), 3 DFS_read_lines (DFS), 3 DFS_rename (DFS), 3 DFS_tail (DFS), 3 DFS_write_lines (DFS), 3 hive, 6 hive_create (hive), 6 hive_get_masters (configuration), 2 hive_get_nreducer (configuration), 2 hive_get_parameter (configuration), 2 hive_get_slaves (configuration), 2 hive_is_available (hive), 6 hive_set_nreducer (configuration), 2 hive_start (hive), 6 hive_stop (hive), 6 hive_stream, 7 10