RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)
|
|
- Douglas Miller
- 8 years ago
- Views:
Transcription
1 RHadoop and MapR Accessing Enterprise- Grade Hadoop from R Version 2.0 (14.March.2014)
2 Table of Contents Introduction... 3 Environment... 3 R... 3 Special Installation Notes... 4 Install R... 5 Install RHadoop... 5 Install rhdfs... 5 Install rmr Install rhbase Conclusion Resources RHadoop and MapR
3 Introduction RHadoop is an open source collection of three R packages created by Revolution Analytics that allow users to manage and analyze data with Hadoop from an R environment. It allows data scientists familiar with R to quickly utilize the enterprise- grade capabilities of the MapR Hadoop distribution directly with the analytic capabilities of R. This paper provides step- by- step instructions to install and use RHadoop with MapR and R on RedHat Enterprise Linux. RHadoop consists of the following packages: rhdfs - functions providing file management of the HDFS from within R rmr2 - functions providing Hadoop MapReduce functionality in R rhbase - functions providing database management for the HBase distributed database from within R Each of the RHadoop packages can be installed and used independently or in conjunction with each other. Environment The integration testing described in this paper was performed in March 2014 on a 3- node Amazon EC2 cluster. The product versions in the test are listed in the table below. Note that Revolution Analytics currently provides Linux support only on RedHat. Product EC2 AMI Type Root/Boot MapR storage RedHat Enterprise Linux bit Java MapR M7 HBase GNU R RHadoop rhdfs rmr2 rhbase Apache Thrift Version RHEL- 6.4 x86_64 (ami e31) m1.large 8GB EBS standard (3) 450GB EBS standard el6.x86_64 java openjdk.x86_64 java openjdk- devel.x86_ ( GA- 1) R R is both a language and environment for statistical computation and is freely available from GNU. Revolution Analytics provides two versions of R: the free Revolution R Community and the premium Revolution R Enterprise for workstations and servers. Revolution R Community is an enhanced distribution of the open source R for users 3 RHadoop and MapR
4 looking for faster performance and greater stability. Revolution R Enterprise adds commercial enhancements and professional support for real- world use. It brings higher performance, greater scalability, and stronger reliability to R at a fraction of the cost of legacy products. R is easily extended with libraries that are distributed in packages. Packages are collections of R functions, data, and compiled code in a well- defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used. Special Installation Notes These installation instructions are specific to the product versions specified earlier in this document. Some modifications may be required for your environment. A package repository must be available to install dependent packages. The MapR cluster must be installed and running either Hadoop HBase or MapR tables. You will need root privileges on all nodes in the cluster. MapR installation instructions are available on the MapR Documentation web site: All commands entered by the user are in bold courier font. Commands entered from within the R environment are preceded by the default R prompt. Linux shell commands are preceded by either the '' character (for the root user) or the '$' character (for a non- root user). For example, the following represents running the yum command as the root user: yum install git -y The following example represents running a command within the R shell: library(rhdfs) Similar to the R shell, the following represents running a command within the HBase shell: hbase(main):001:0 create 'mytable', 'cf1' Note that some Linux commands are long and wrap across multiple lines in this document. Linux commands use the backslash "\" character to escape the carriage return. Similarly, in the R shell, long commands use the comma "," character to escape the carriage return. This facilitates copying from this document and pasting into a Linux terminal window. Here is an example of a long Linux command that wraps to multiple lines in this document: su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop /hadoop-*-\ examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out" And there is an example of a long R command that wraps to multiple lines in this document: install.packages(c('rcpp'), 4 RHadoop and MapR
5 Unless otherwise indicated, all commands should be run as the root user on the client and task tracker systems. This document assumes there is a non- root user called user01 on your client system for purposes of validating the installation. You can use any non- root user in those commands that can access the MapR cluster. Just remember to replace all occurrences of user01 with your non- root user. Finally, note that these installation instructions assume your client systems are running RedHat Linux as that is the only operating system supported by Revolution Analytics. Install R Testing for this paper was done with GNU R. If you have Revolution R Community or Enterprise, you can use that version of R instead. Follow Revolution Analytics installation instructions for the appropriate edition. A version of R must always be installed on the client system accessing the cluster using the RHadoop libraries. Additionally, to execute MapReduce jobs with the rmr2 library, R must be installed on all task tracker nodes in the cluster. As the root user, follow the installation steps below to install GNU R on all client systems and task trackers. 1) Install GNU R. yum -y --enablerepo=epel install R 2) Install the GNU R developer package. yum -y --enablerepo=epel install R-devel Note that the R-devel package may already be up to date when installing the R package. 3) Confirm installation was successful by running R as a non- root user on your client system and all your task trackers. At the command line, type the following command to determine the version of R that is installed. su - user01 -c "R --version" Install RHadoop The installation instructions that follow are complete for each RHadoop package (rhdfs, rmr2, rhbase). System administrators can skip to installation instructions of just the package(s) they want to install. Recall that R must be installed before installing any of the RHadoop packages. Install rhdfs The rhdfs package uses the hadoop command to access MapR file services. To use rhdfs, R and the rhdfs package only need to be installed on the client system that is accessing the cluster. This can be a node in the cluster or it can be any client system that can access the cluster with the hadoop command. As the root user, perform the following steps on every client node. 1) Confirm that you can access the MapR file services by listing the contents of the top- level directory. su - user01 -c "hadoop fs -ls /" 2) Install the rjava R package that is required by rhdfs. R --save 5 RHadoop and MapR
6 install.packages(c('rjava'), repos=" quit() 3) Download the rhdfs package from github. yum install git -y cd ~ git clone git://github.com/revolutionanalytics/rhdfs.git 4) Set the HADOOP_CMD environment variable to the hadoop command script and install the rhdfs package. Whereas the rjava package in the previous step was downloaded and installed from a CRAN repository, rhdfs is installed from with the R client. export HADOOP_CMD=/opt/mapr/hadoop/hadoop /bin/hadoop R CMD INSTALL ~/rhdfs/pkg 5) Set required environment variables. In addition to the HADOOP_CMD environment variable set in the previous step, LD_LIBRARY_PATH must include the location of the MapR client library libmaprclient.so and HADOOP_CONF must specify the Hadoop configuration directory. Any user wanting to use the rhdfs library must set these environment variables. export LD_LIBRARY_PATH=/opt/mapr/lib:$LD_LIBRARY_PATH export HADOOP_CONF=/opt/mapr/hadoop/hadoop /conf 6) Switch user to your non- root user to validate the installation was successful. su user01 $ 7) From R, load the rhdfs library and confirm that you can access the MapR cluster file system by listing the root directory. $ R --no-save library(rhdfs) hdfs.init() hdfs.ls('/') quit() Note: When loading an R library using the library() command, dependent libraries will also be loaded. For rhdfs, the rjava library will be loaded if it has not already been loaded. 8) Exit from your su command. 6 RHadoop and MapR
7 $ exit 9) Check the installation of the rhdfs package. R CMD check ~/rhdfs/pkg; echo $? An exit code of 0 means the installation was successful. If no errors are reported, you have successfully installed rhdfs and can use it to access the MapR file services from R (you can safely ignore any "notes"). 10) (optional) Persist the required environment variable settings for the shells in your client environment. The command below will set the variables for bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" \ /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop /bin/hadoop" \ /etc/profile Install rmr2 The rmr2 package uses Hadoop Streaming to invoke R map and reduce functions. To use rmr2, the R and the rmr2 packages must be installed on the clients as well as every task tracker node in the MapR cluster. Install rmr2 on every tasktracker node AND on every client system as the root user. 1) Install required R packages: R REPL --no-save install.packages(c('rcpp'), install.packages(c('rjsonio'), install.packages(c('itertools'), install.packages(c('digest'), install.packages(c('functional'), 7 RHadoop and MapR
8 install.packages(c('stringr'), install.packages(c('plyr'), install.packages(c('bitops'), install.packages(c('reshape2'), install.packages(c('catools'), quit() Note: Warnings may be safely ignored while the packages are being built. 2) Download the quickcheck and rmr2 packages from github. RHadoop includes the quickcheck package to support writing randomized unit tests performed by the rmr2 package check. cd ~ git clone git://github.com/revolutionanalytics/quickcheck.git git clone git://github.com/revolutionanalytics/rmr2.git 3) Install the quickcheck and rmr2 packages. R CMD INSTALL ~/quickcheck/pkg R CMD INSTALL ~/rmr2/pkg Note: Warnings may be safely ignored while the packages are being built. 4) From any task tracker node, create a directory called /tmp (if it doesn't already exist) in the root of your MapR- FS (owned by the mapr user) and give it global read- write permissions. This directory is required for running MapReduce applications in R using the rmr2 package. su - mapr -c "hadoop fs -mkdir /tmp" Note: the command above will fail gracefully if the /tmp directory already exists. su - mapr -c "hadoop fs -chmod 777 /tmp" 8 RHadoop and MapR
9 Validate rmr2 on any single client system as a non- root user with the following steps. Note that the rmr2 package must be installed on ALL the task trackers before proceeding. 1) Confirm that your MapR cluster is configured to run a simple MapReduce job outside of the RHadoop environment. You only need to perform this step on one client from which you intend to run your R MapReduce programs. su - user01 -c "hadoop fs -mkdir /tmp/mrtest/wc-in" su - user01 -c "hadoop fs -put /opt/mapr/notice.txt /tmp/mrtest/wc-in" su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop /hadoop-*-\ examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out" su - user01 -c "hadoop fs -cat /tmp/mrtest/wc-out/part-r-00000" su - user01 -c "hadoop fs -rmr /tmp/mrtest" 2) Copy the wordcount.r script to a directory that is accessible by your non- root user. cp ~/rmr2/pkg/tests/wordcount.r /tmp 3) Set the required environment variables HADOOP_CMD and HADOOP_STREAMING. Since rmr2 uses Hadoop Streaming, it needs access to both the hadoop command and the streaming jar. export HADOOP_CMD=/opt/mapr/hadoop/hadoop /bin/hadoop export HADOOP_STREAMING=/opt/mapr/hadoop/hadoop-\ /contrib/streaming/hadoop dev-streaming.jar 4) Switch user to your non- root user to validate the installation was successful. su user01 5) Run the wordcount.r program from the R environment as your non- root user. $ R --no-save < /tmp/wordcount.r; echo $? Note that an exit code of 0 means the command was successful. 6) Exit from your su command. $ exit 7) Run the full rmr2 check. The examples run by the rmr2 check below will sequentially generate 81 streaming MapReduce jobs on the cluster. 80 of the jobs have just 2 mappers so a large cluster will not speed this up. On a 3- node medium EC2 cluster with two task trackers, the examples take just over 1 hour. You may wish to launch this under nohup as shown below and wait for it to complete before you proceed in this document. nohup R CMD check ~/rmr2/pkg ~/rmr2-check.out & 9 RHadoop and MapR
10 Check the output in ~/rmr2-check.out for any errors. 8) (optional) Persist the required environment variable settings for the shells in your client environment. The command below will set the variables for Bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" \ /etc/profile echo "export HADOOP_CONF=/opt/mapr/hadoop/hadoop /conf" \ /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop /bin/hadoop" \ /etc/profile Install rhbase The rhbase package accesses HBase via the HBase Thrift server which is included in the MapR HBase distribution. The rhbase package is a Thrift client that sends requests and receives responses from the Thrift server. The Thrift server listens for Thrift requests and in turn uses the HBase HTable java class to access HBase. Since rhbase is a client- side technology, it only needs to be installed on the client system that will access the MapR HBase cluster. Any MapR HBase cluster node can also be a client. For the client system to access a local Thrift server, the client system must have the mapr-hbase-internal packaged installed which includes the MapR HBase Thrift server. If your client system is one of the MapR HBase Masters or Region Servers, it will already have this package installed. These rhbase installation instructions assume that mapr-hbase-internal is already installed on the client system. In addition to the HBase Thrift server, the rhbase package requires the Thrift include files to compile and the C++ thrift library at runtime in order to be a Thrift client. Since these Thrift components are not included in the MapR distribution, Thrift must be installed before rhbase. By default, rhbase connects to a Thrift server on the local host. A remote server can be specified in the rhbase hb.init() call, but the rhbase package check expects the Thrift server to be local. These installation instructions assume the Thrift server is running locally and that HBase is installed and running in your cluster. As the root user, perform the following installation steps on ALL task tracker nodes and client systems. 1) Install (or update) prerequisite packages. yum -y install automake libtool flex bison pkgconfig gcc-c++ boost-devel \ libevent-devel zlib-devel python-devel openssl-devel ruby-devel qt qt-\ devel php-devel 2) Download, build, and install Thrift. cd ~ git clone thrift cd ~/thrift 10 RHadoop and MapR
11 sed -i s/2.65/2.63/ configure.ac./bootstrap.sh./configure make && make install /sbin/ldconfig /usr/lib/libthrift so 3) Download the rhbase package. cd ~ git clone git://github.com/revolutionanalytics/rhbase.git 4) Modify the thrift.pc file. We need to add the thrift directory to the end of the includedir configuration. sed -i \ 's/^includedir=\${prefix}\/include/includedir=\${prefix}\/include\/thrift\ /' ~/thrift/lib/cpp/thrift.pc 5) Install the rhbase package. The LD_LIBRARY_PATH must be set to find the Thrift library (libthrift.so) which was installed as part of the Thrift installation. Also, PKG_CONFIG_PATH must point to the directory containing the thrift.pc package configuration file. export LD_LIBRARY_PATH=/usr/local/lib export PKG_CONFIG_PATH=~/thrift/lib/cpp R CMD INSTALL ~/rhbase/pkg 6) Configure the file /opt/mapr/hbase/hbase /conf/hbase-site.xml with the HBase zookeeper servers and their port number. For HBase Master Servers and HBase Region Servers, zookeeper servers should already be properly configured in this file. For client only systems, edit the hbase.zookeeper.quorum and hbase.zookeeper.property.clientport properties to correspond to your zookeeper servers. su - mapr -c "maprcli node listzookeepers" su - mapr -c "vi /opt/mapr/hbase/hbase /conf/hbase-site.xml"... <property <namehbase.zookeeper.quorum</name <valuezkhost1,zkhost2,zkhost3</value </property <property <namehbase.zookeeper.property.clientport</name <value5181</value </property RHadoop and MapR
12 7) Start the MapR HBase Thrift server as a background daemon. /opt/mapr/hbase/hbase /bin/hbase-daemon.sh start thrift Note: Pass the parameters stop thrift to the hbase-daemon.sh script to stop the daemon. 8) Run the rhbase package checks. LD_LIBRARY_PATH and PKG_CONFIG_PATH must still be set. Note that the command will produce errors you can safely ignore if you are not running Hadoop HBase in your MapR cluster. Recall that the rhbase package is compatible with MapR tables which does not require Hadoop HBase to be installed and running in your MapR cluster. R CMD check ~/rhbase/pkg Note: When running the rhbase package check, warnings can be safely ignored. 9) (optional) Persist the required environment variable settings for the shells in your environment. The command below will set the variables for Bourne and bash shell users. You may wish to examine the /etc/profile file first before making the following edits to ensure that you're not duplicating or clobbering existing settings. echo -e "export \ LD_LIBRARY_PATH=/usr/local/lib:/usr/lib64/R/library/rhbase/libs:\$\ LD_LIBRARY_PATH" /etc/profile echo "export PKG_CONFIG_PATH=~root/thrift/lib/cpp" /etc/profile echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop /bin/hadoop" \ /etc/profile Test rhbase with Hadoop HBase The rhbase package is now installed and ready for use by any user on the system. Validate that a non- root user can access HBase via the HBase shell and from rhbase. Note that the instructions below assume the environment variables LD_LIBRARY_PATH and HADOOP_CMD have been configured for the user01 user. They also assume you are running Hadoop HBase in your MapR cluster. 1) Start the HBase shell, create a table, display its description, and drop it. su user01 -c "hbase shell" hbase(main):001:0 create 'mytable', 'cf1' hbase(main):002:0 describe 'mytable' hbase(main):003:0 disable 'mytable' hbase(main):004:0 drop 'mytable' hbase(main):005:0 12 RHadoop and MapR
13 quit 2) Now perform the same test with rhbase. su user01 -c "R --save" library(rhbase) hb.init() hb.new.table('mytable', 'cf1') hb.describe.table('mytable') hb.disable.table('mytable') hb.delete.table('mytable') q() Test rhbase with MapR Tables The rhbase package is compatible with MapR tables. You can use the MapR tables feature if you have an M7 license installed on your MapR cluster. Simply use absolute paths in MapR- FS for your table names rather than relative paths as for HBase. The following steps assume that the user01 home directory is in a MapR file system called /mapr/mycluster/home/user01. 1) Start the HBase shell, create a table, display its description, and drop it. su user01 -c "hbase shell" hbase(main):001:0 create '/mapr/mycluster/home/user01/mytable', 'cf1' hbase(main):002:0 describe '/mapr/mycluster/home/user01/mytable' hbase(main):003:0 disable '/mapr/mycluster/home/user01/mytable' hbase(main):004:0 drop '/mapr/mycluster/home/user01/mytable' hbase(main):005:0 quit 2) Now perform the same test with rhbase. su user01 -c "R --save" library(rhbase) hb.init() hb.new.table('/mapr/mycluster/home/user01/mytable', 'cf1') 13 RHadoop and MapR
14 hb.describe.table('/mapr/mycluster/home/user01/mytable') hb.delete.table('/mapr/mycluster/home/user01/mytable') q() Conclusion With Revolution Analytics RHadoop packages and MapR s enterprise grade Hadoop distribution, data scientists can utilize the full potential of Hadoop from the familiar R environment. Resources More information can be found on RHadoop, and other technologies referenced in this paper at the links below. GNU R home page: project.org RHadoop home page: Apache Thrift: Revolution Analytics: MapR Technologies: About MapR Technologies MapR s advanced distribution for Apache Hadoop delivers on the promise of Hadoop, making the management and analysis of big data a practical reality for more organizations. MapR s advanced capabilities, such as streaming analytics, mission- critical data protection, and MapR tables expand the breadth and depth of use cases across industries. About Revolution Analytics Revolution Analytics (formerly Revolution Computing) was founded in 2007 to foster the R Community, as well as support the growing needs of commercial users. Our name derives from combining the letter "R" with the word "evolution." It speaks to the ongoing development of the R language from an open- source academic research tool into commercial applications for industrial use. 03/14/ RHadoop and MapR
RHadoop Installation Guide for Red Hat Enterprise Linux
RHadoop Installation Guide for Red Hat Enterprise Linux Version 2.0.2 Update 2 Revolution R, Revolution R Enterprise, and Revolution Analytics are trademarks of Revolution Analytics. All other trademarks
More informationRevolution R Enterprise 7 Hadoop Configuration Guide
Revolution R Enterprise 7 Hadoop Configuration Guide The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2014. Revolution R Enterprise 7 Hadoop Configuration Guide.
More informationKognitio Technote Kognitio v8.x Hadoop Connector Setup
Kognitio Technote Kognitio v8.x Hadoop Connector Setup For External Release Kognitio Document No Authors Reviewed By Authorised By Document Version Stuart Watt Date Table Of Contents Document Control...
More informationSingle Node Hadoop Cluster Setup
Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps
More informationHadoop Installation MapReduce Examples Jake Karnes
Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an
More informationCS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
More informationHow To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
More informationHadoop Hands-On Exercises
Hadoop Hands-On Exercises Lawrence Berkeley National Lab Oct 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming
More informationRevolution R Enterprise 7 Hadoop Configuration Guide
Revolution R Enterprise 7 Hadoop Configuration Guide The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2015. Revolution R Enterprise 7 Hadoop Configuration Guide.
More informationUsing The Hortonworks Virtual Sandbox
Using The Hortonworks Virtual Sandbox Powered By Apache Hadoop This work by Hortonworks, Inc. is licensed under a Creative Commons Attribution- ShareAlike3.0 Unported License. Legal Notice Copyright 2012
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationSetup Hadoop On Ubuntu Linux. ---Multi-Node Cluster
Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit
More informationCDH 5 Quick Start Guide
CDH 5 Quick Start Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this
More informationOpenGeo Suite for Linux Release 3.0
OpenGeo Suite for Linux Release 3.0 OpenGeo October 02, 2012 Contents 1 Installing OpenGeo Suite on Ubuntu i 1.1 Installing OpenGeo Suite Enterprise Edition............................... ii 1.2 Upgrading.................................................
More informationSparkLab May 2015 An Introduction to
SparkLab May 2015 An Introduction to & Apostolos N. Papadopoulos Assistant Professor Data Engineering Lab, Department of Informatics, Aristotle University of Thessaloniki Abstract Welcome to SparkLab!
More informationLeveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data
Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data 1 Introduction SAP HANA is the leading OLTP and OLAP platform delivering instant access and critical business insight
More information研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1
102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計
More informationHSearch Installation
To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all
More informationNIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September 2014. National Institute of Standards and Technology (NIST)
NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop September 2014 Dylan Yaga NIST/ITL CSD Lead Software Designer Fernando Podio NIST/ITL CSD Project Manager National Institute of Standards
More informationHadoop Training Hands On Exercise
Hadoop Training Hands On Exercise 1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe
More informationSingle Node Setup. Table of contents
Table of contents 1 Purpose... 2 2 Prerequisites...2 2.1 Supported Platforms...2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster... 3 5 Standalone
More informationTP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
More informationHow to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
More informationCactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)
CactoScale Guide User Guide Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) Version History Version Date Change Author 0.1 12/10/2014 Initial version Athanasios Tsitsipas(UULM)
More informationIntroduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St
More informationThis handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.
AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. 1) Start up 3
More informationA. Aiken & K. Olukotun PA3
Programming Assignment #3 Hadoop N-Gram Due Tue, Feb 18, 11:59PM In this programming assignment you will use Hadoop s implementation of MapReduce to search Wikipedia. This is not a course in search, so
More informationINTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS
INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data
More informationRunning Kmeans Mapreduce code on Amazon AWS
Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:
More informationHadoop Data Warehouse Manual
Ruben Vervaeke & Jonas Lesy 1 Hadoop Data Warehouse Manual To start off, we d like to advise you to read the thesis written about this project before applying any changes to the setup! The thesis can be
More informationHow To Write A Mapreduce Program On An Ipad Or Ipad (For Free)
Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming
More informationIntroduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
More informationCloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and
More informationThe objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.
Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone
More informationHadoop (pseudo-distributed) installation and configuration
Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under
More informationBig Business, Big Data, Industrialized Workload
Big Business, Big Data, Industrialized Workload Big Data Big Data 4 Billion 600TB London - NYC 1 Billion by 2020 100 Million Giga Bytes Copyright 3/20/2014 BMC Software, Inc 2 Copyright 3/20/2014 BMC Software,
More informationData Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
More informationPivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03
Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide Rev: A03 Use of Open Source This product may be distributed with open source code, licensed to you in accordance with the applicable open source
More informationApache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.
EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure
More informationHadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions
Hadoop Lab - Setting a 3 node Cluster Packages Hadoop Packages can be downloaded from: http://hadoop.apache.org/releases.html Java - http://wiki.apache.org/hadoop/hadoopjavaversions Note: I have tested
More informationMapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
More informationHadoop Hands-On Exercises
Hadoop Hands-On Exercises Lawrence Berkeley National Lab July 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming
More informationWhat We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationEasily parallelize existing application with Hadoop framework Juan Lago, July 2011
Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 There are three ways of installing Hadoop: Standalone (or local) mode: no deamons running. Nothing to configure after
More informationPackage hive. January 10, 2011
Package hive January 10, 2011 Version 0.1-9 Date 2011-01-09 Title Hadoop InteractiVE Description Hadoop InteractiVE, is an R extension facilitating distributed computing via the MapReduce paradigm. It
More informationHadoop Setup. 1 Cluster
In order to use HadoopUnit (described in Sect. 3.3.3), a Hadoop cluster needs to be setup. This cluster can be setup manually with physical machines in a local environment, or in the cloud. Creating a
More informationJobScheduler - Amazon AMI Installation
JobScheduler - Job Execution and Scheduling System JobScheduler - Amazon AMI Installation March 2015 March 2015 JobScheduler - Amazon AMI Installation page: 1 JobScheduler - Amazon AMI Installation - Contact
More informationConfiguring Informatica Data Vault to Work with Cloudera Hadoop Cluster
Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster 2013 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,
More informationHow to Run Spark Application
How to Run Spark Application Junghoon Kang Contents 1 Intro 2 2 How to Install Spark on a Local Machine? 2 2.1 On Ubuntu 14.04.................................... 2 3 How to Run Spark Application on a
More informationSource Code Management for Continuous Integration and Deployment. Version 1.0 DO NOT DISTRIBUTE
Source Code Management for Continuous Integration and Deployment Version 1.0 Copyright 2013, 2014 Amazon Web Services, Inc. and its affiliates. All rights reserved. This work may not be reproduced or redistributed,
More informationExtreme computing lab exercises Session one
Extreme computing lab exercises Session one Michail Basios (m.basios@sms.ed.ac.uk) Stratis Viglas (sviglas@inf.ed.ac.uk) 1 Getting started First you need to access the machine where you will be doing all
More informationPivotal HD Enterprise
PRODUCT DOCUMENTATION Pivotal HD Enterprise Version 1.1 Stack and Tool Reference Guide Rev: A01 2013 GoPivotal, Inc. Table of Contents 1 Pivotal HD 1.1 Stack - RPM Package 11 1.1 Overview 11 1.2 Accessing
More informationIDS 561 Big data analytics Assignment 1
IDS 561 Big data analytics Assignment 1 Due Midnight, October 4th, 2015 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted with the code
More informationCSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
More informationLecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
More informationInstall BA Server with Your Own BA Repository
Install BA Server with Your Own BA Repository This document supports Pentaho Business Analytics Suite 5.0 GA and Pentaho Data Integration 5.0 GA, documentation revision February 3, 2014, copyright 2014
More informationMarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015
MarkLogic Connector for Hadoop Developer s Guide 1 MarkLogic 8 February, 2015 Last Revised: 8.0-3, June, 2015 Copyright 2015 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents
More informationHDFS Cluster Installation Automation for TupleWare
HDFS Cluster Installation Automation for TupleWare Xinyi Lu Department of Computer Science Brown University Providence, RI 02912 xinyi_lu@brown.edu March 26, 2014 Abstract TupleWare[1] is a C++ Framework
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationSymantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0
Symantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0 The software described in this book is furnished under a license agreement and may be used only in accordance with the
More informationDeploying MongoDB and Hadoop to Amazon Web Services
SGT WHITE PAPER Deploying MongoDB and Hadoop to Amazon Web Services HCCP Big Data Lab 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301)
More informationHadoop 2.6.0 Setup Walkthrough
Hadoop 2.6.0 Setup Walkthrough This document provides information about working with Hadoop 2.6.0. 1 Setting Up Configuration Files... 2 2 Setting Up The Environment... 2 3 Additional Notes... 3 4 Selecting
More informationITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
More informationDeploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters
CONNECT - Lab Guide Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters Hardware, software and configuration steps needed to deploy Apache Hadoop 2.4.1 with the Emulex family
More informationIntegrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application Server The following document describes how to build an Apache/Tomcat server from all source code. The end goal of this document is to configure the
More informationHadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay
Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop-
More informationCONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT
CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT With this post we thought of sharing a tutorial for configuring Eclipse IDE (Intergrated Development Environment) for Amazon AWS EMR scripting and development.
More informationContents Set up Cassandra Cluster using Datastax Community Edition on Amazon EC2 Installing OpsCenter on Amazon AMI References Contact
Contents Set up Cassandra Cluster using Datastax Community Edition on Amazon EC2... 2 Launce Amazon micro-instances... 2 Install JDK 7... 7 Install Cassandra... 8 Configure cassandra.yaml file... 8 Start
More informationBasic Hadoop Programming Skills
Basic Hadoop Programming Skills Basic commands of Ubuntu Open file explorer Basic commands of Ubuntu Open terminal Basic commands of Ubuntu Open new tabs in terminal Typically, one tab for compiling source
More informationINSTALLING KAAZING WEBSOCKET GATEWAY - HTML5 EDITION ON AN AMAZON EC2 CLOUD SERVER
INSTALLING KAAZING WEBSOCKET GATEWAY - HTML5 EDITION ON AN AMAZON EC2 CLOUD SERVER A TECHNICAL WHITEPAPER Copyright 2012 Kaazing Corporation. All rights reserved. kaazing.com Executive Overview This document
More informationIUCLID 5 Guidance and support. Installation Guide Distributed Version. Linux - Apache Tomcat - PostgreSQL
IUCLID 5 Guidance and support Installation Guide Distributed Version Linux - Apache Tomcat - PostgreSQL June 2009 Legal Notice Neither the European Chemicals Agency nor any person acting on behalf of the
More informationTutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
More informationInstallation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics
Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics www.thinkbiganalytics.com 520 San Antonio Rd, Suite 210 Mt. View, CA 94040 (650) 949-2350 Table of Contents OVERVIEW
More informationHadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Part: 1 Exploring Hadoop Distributed File System An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government
More informationОбработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014
Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014 1 Содержание Бигдайта: распределенные вычисления и тренды MapReduce: концепция и примеры реализации
More informationFile S1: Supplementary Information of CloudDOE
File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.
More informationInstallation Guide. McAfee VirusScan Enterprise for Linux 1.9.0 Software
Installation Guide McAfee VirusScan Enterprise for Linux 1.9.0 Software COPYRIGHT Copyright 2013 McAfee, Inc. Do not copy without permission. TRADEMARK ATTRIBUTIONS McAfee, the McAfee logo, McAfee Active
More informationdocs.hortonworks.com
docs.hortonworks.com : Security Administration Tools Guide Copyright 2012-2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform
More informationHOD Scheduler. Table of contents
Table of contents 1 Introduction... 2 2 HOD Users... 2 2.1 Getting Started... 2 2.2 HOD Features...5 2.3 Troubleshooting... 14 3 HOD Administrators... 21 3.1 Getting Started... 22 3.2 Prerequisites...
More informationHadoop Tutorial. General Instructions
CS246: Mining Massive Datasets Winter 2016 Hadoop Tutorial Due 11:59pm January 12, 2016 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted
More informationGetting Started with Amazon EC2 Management in Eclipse
Getting Started with Amazon EC2 Management in Eclipse Table of Contents Introduction... 4 Installation... 4 Prerequisites... 4 Installing the AWS Toolkit for Eclipse... 4 Retrieving your AWS Credentials...
More informationInstalling Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.
Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You
More informationInstallation and Configuration Documentation
Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationIntegrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
More informationOLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)
Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationInstallation of PHP, MariaDB, and Apache
Installation of PHP, MariaDB, and Apache A few years ago, one would have had to walk over to the closest pizza store to order a pizza, go over to the bank to transfer money from one account to another
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationHADOOP - MULTI NODE CLUSTER
HADOOP - MULTI NODE CLUSTER http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm Copyright tutorialspoint.com This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed
More informationInstalling and Configuring MySQL as StoreGrid Backend Database on Linux
Installing and Configuring MySQL as StoreGrid Backend Database on Linux Overview StoreGrid now supports MySQL as a backend database to store all the clients' backup metadata information. Unlike StoreGrid
More informationSpring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
More informationPaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random
More informationA Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
More informationSAP Predictive Analysis Installation
SAP Predictive Analysis Installation SAP Predictive Analysis is the latest addition to the SAP BusinessObjects suite and introduces entirely new functionality to the existing Business Objects toolbox.
More informationALERT installation setup
ALERT installation setup In order to automate the installation process of the ALERT system, the ALERT installation setup is developed. It represents the main starting point in installing the ALERT system.
More informationINF-110. GPFS Installation
INF-110 GPFS Installation Overview Plan the installation Before installing any software, it is important to plan the GPFS installation by choosing the hardware, deciding which kind of disk connectivity
More information