Install Hadoop on Ubuntu and run as standalone

Similar documents

研發專案原始程式碼安裝及操作手冊. Version 0.1

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

HSearch Installation

Installation and Configuration Documentation

Hadoop Installation Guide

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Chase Wu New Jersey Ins0tute of Technology

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

CS 455 Spring Word Count Example

Hadoop Installation. Sandeep Prasad

How To Install Hadoop From Apa Hadoop To (Hadoop)

2.1 Hadoop a. Hadoop Installation & Configuration

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop (pseudo-distributed) installation and configuration

How to install Apache Hadoop in Ubuntu (Multi node setup)

Single Node Hadoop Cluster Setup

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

CDH 5 Quick Start Guide

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

HADOOP - MULTI NODE CLUSTER

Hadoop Lab - Setting a 3 node Cluster. Java -

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Running Kmeans Mapreduce code on Amazon AWS

Centrify Server Suite For MapR 4.1 Hadoop With Multiple Clusters in Active Directory

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Hadoop Training Hands On Exercise

Distributed Filesystems

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Hadoop Ecosystem B Y R A H I M A.

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Ankush Cluster Manager - Hadoop2 Technology User Guide

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Hadoop Basics with InfoSphere BigInsights

docs.hortonworks.com

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST II

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop MultiNode Cluster Setup

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

TP1: Getting Started with Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hands-on Exercises with Big Data

Tableau Spark SQL Setup Instructions

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to HDFS. Prasanth Kothuri, CERN

Pivotal HD Enterprise

Using The Hortonworks Virtual Sandbox

Big Data Management and NoSQL Databases

COURSE CONTENT Big Data and Hadoop Training

Introduction to HDFS. Prasanth Kothuri, CERN

Setting up Hadoop with MongoDB on Windows 7 64-bit

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

The Hadoop Eco System Shanghai Data Science Meetup

Apache Hadoop new way for the company to store and analyze big data

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Hadoop. Sunday, November 25, 12

How To Use Hadoop

How To Scale Out Of A Nosql Database

Pivotal HD Enterprise

Hadoop 2.6 Configuration and More Examples

HDFS Installation and Shell

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

RDMA for Apache Hadoop User Guide

Apache Hadoop. Alexandru Costan

Integration Of Virtualization With Hadoop Tools

Perforce Helix Threat Detection On-Premise Deployment Guide

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Spectrum Scale HDFS Transparency Guide

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

HADOOP MOCK TEST HADOOP MOCK TEST I

Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Getting to know Apache Hadoop

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Important Notice. (c) Cloudera, Inc. All rights reserved.

Accelerating and Simplifying Apache

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Transcription:

Welcome, this document is a record of my installation of Hadoop for study purpose. Version Version Date Content and Change 1.0 2013 Dec Initialize study Hadoop Install basic environment, run first word count program 2.0 2014 May Expend Hadoop to master+slave (1 master + 2 slave nodes) on RHEL Run word count on bigger file, see improvement Study Hadoop based projects Hive, Pig, Hbase Install Hbase on Hadoop as full distribute mode Install Hadoop on Ubuntu and run as standalone Hadoop version: 2.2.0 OS: Ubuntu 32 bit hduser@ubuntu:~$ cat /etc/issue Ubuntu 12.10 \n \l Host: VMWARE workstation on Windows 8 Please be advice this is a draft, and just fit my own circumstance. References: http://blog.csdn.net/focusheart/article/details/14005893 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ http://cs.smith.edu/dftwiki/index.php/hadoop_tutorial_1_--_running_wordcount http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/commandsmanual.html Developer http://www.osedu.net/article/nosql/2012-05-02/435.html http://www.slideshare.net/waue/hadoop-map-reduce-3019713!!to-do list!! Following things need to be verified:

1. Job tracker interface can t show Fixed 2. Where is job tracker? It seems they changed job tracker to YARN (Next generation Map-reduce) Use local:8088 instead of localhost 50030 http://docs.aws.amazon.com/elasticmapreduce/latest/developerguide/emr-hadoop-2.2.0- features.html https://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoopcommon/clustersetup.html Web Interfaces Once the Hadoop cluster is up and running check the web-ui of the components as described below: NameNode http://nn_host:port/ ResourceManager http://rm_host:port/ MapReduce JobHistory Server http://jhs_host:port/ Apache Hadoop NextGen MapReduce (YARN) MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN. So when check with JPS, there is no longer has jobtracker http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-users/ Web UI In MR1, the JobTracker web UI served detailed information about the state of the cluster and the jobs currently and recently running on it. It also contained the job history page, which served information from disk about older jobs.

The MR2 web UI provides the same information structured in the same way, but has been revamped with a new look and feel. The ResourceManager UI, which includes information about running applications and the state of the cluster, is now located by default at:8088. The job history UI is now located by default at:19888. Jobs can be searched and viewed there just as they could in MR1. Because the ResourceManager is meant to be agnostic to many of the concepts in MapReduce, it cannot host job information directly. Instead, it proxies to a web UI that can. If the job is running, this is the relevant MapReduce Application Master; if it has completed, this is the JobHistoryServer. In this sense, the user experience is similar to that of MR1, but the information is coming from different places. Some conceptions What Is Apache Hadoop? The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Other Hadoop-related projects at Apache include: Ambari : A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation.

Spark : A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. ZooKeeper : A high-performance coordination service for distributed applications. Hadoop follows the idea of job map and reduce. It has a DFS file system to support distribution Important ports 1.Job Tracker :50030 (no longer in 2.20?) 2.HDFS :50070 3.HDFS communication:9000 4.MapReducecommunication:9001 Management 1. HDFS web interface http://hostname:50070 2. MapReduce interface http://hostname:50030 Install Ubuntu You may like to install it by a flash disk (on a new tower) Install required packages For example: JRE/JDK $ sudo apt-get update $ sudo apt-get install sun-java6-jdk Add user and user group $ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser This will add the user hduser and the group hadoop to your local machine. Configuring SSH user@ubuntu:~$ su - hduser hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key's randomart image is: [...snipp...] hduser@ubuntu:~$ If success, you may ssh to your host without input credentials Disable IPV6 (not proved whether necessary) # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 You have to reboot your machine in order to make the changes take effect. You can check whether IPv6 is enabled on your machine with the following command: $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Install Hadoop files $ cd /usr/local $ sudo tar xzf hadoop-1.0.3.tar.gz $ sudo mv hadoop-1.0.3 hadoop $ sudo chown -R hduser:hadoop hadoop You may also think to make you directory where to place DFS data file, in my case it is /hadoopfs. If necessary, grant owner to hduser:hadoop Update environment variables vim /home/hduser/.rcbash export HADOOP_HOME=/usr/local/hadoop unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls" lzohead () { } hadoop fs -cat $1 lzop -dc head -1000 less export PATH=$PATH:$HADOOP_HOME/bin export JAVA_HOME=/usr/lib/jvm/default-java Configure site Modify core-site.xml

hduser@ubuntu:/usr/local/hadoop/etc/hadoop$ pwd /usr/local/hadoop/etc/hadoop Add following lines:?xml version="1.0" encoding="utf-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <name>hadoop.tmp.dir</name> <value>/hadoopfs/tmp</value> <description>a base for other temporary directories.</description> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> <final>true</final> </configuration> Configure DHS Modify hdsf-site.xml. This file indicates information to DFS duser@ubuntu:/usr/local/hadoop/etc/hadoop$ cat hdfs-site.xml

<?xml version="1.0" encoding="utf-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>   <configuration> <name>dfs.replication</name> <value>1</value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> <name>dfs.permissions</name>

<value>false</value> <name>dfs.namenode.name.dir</name> <value>file:/hadoopfs/dfs/name</value> <final>true</final> <name>dfs.datanode.data.dir</name> <value>file:/hadoopfs/dfs/data</value> <final>true</final> </configuration> Configure Map-reduce hduser@ubuntu:/usr/local/hadoop/etc/hadoop$ cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/license-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. -->  <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> <name>mapreduce.jobtracker.system.dir</name> <value>file:/hadoopfs/dfs/system</value> <final>true</final> <name>mapreduce.jobtracker.address</name> <value>localhost:9001</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description>

<name>mapreduce.jobtracker.http.address</name> <value>localhost:50030</value> <name>mapreduce.cluster.local.dir</name> <value>file:/hadoopfs/dfs/local</value> <final>true</final> <name>mapreduce.cluster.temp.dir</name> <value>file:/hadoopfs/dfs/tmp</value> <description>no description</description> <final>true</final> </configuration> Yarn-site.xml <?xml version="1.0"?> <configuration>

<name>yarn.resourcemanager.hostname</name> <value>localhost</value> <description>the host is the hostname of the ResourceManager and the port is the port on which the clients can talk to the Resource Manager. </description> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.shufflehandler</value> <description>shuffle service that needs to be set for Map Reduce to run </description> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </configuration> Initialize DFS $ hdfs namenode -format In some cases if there is some problem with node and namenode DFS, you need to re-format the DFS $ rm -rf /hadoopfs/dfs/* $ rm -rf /hadoopfs/tmp/* Then re-format namenode DFS. Start and stop the server(s) Should be done one by one namenode, datanode and Yarn(resource manager), but you can use startall.sh and stop-all.sh to enable one time start/stop /usr/local/hadoop/sbin

$ hadoop-daemon.sh start namenode $ hadoop-daemon.sh start datanode Monitor the node(s) General nodes information can be checked by web browser at port 50070 (by default) You can also use jps to see following instances are running hduser@ubuntu:/usr/local/hadoop/sbin$ jps 3135 ResourceManager 2678 DataNode 2428 NameNode 5044 Jps 2961 SecondaryNameNode Run a test Let s try this word count example. Don t try to run the randomwriter example by default if you want waste 10 GB

http://wiki.apache.org/hadoop/randomwriter RandomWriter example writes 10 gig (by default) of random data/host to DFS using Map/Reduce. Try to find some txt such like this cool stuff $ wget http://www.gutenberg.org/files/4300/4300.zip $ unzip 4300.zip hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -mkdir /tmp hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -copyfromlocal 4300.txt /tmp Now ready to roll! hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop jar hadoop-mapreduceexamples-2.1.0-beta.jar wordcount /tmp/4300.txt /tmp/output3 14/01/28 20:04:56 WARN conf.configuration: session.id is deprecated. Instead, use dfs.metrics.session-id 14/01/28 20:04:56 INFO jvm.jvmmetrics: Initializing JVM Metrics with processname=jobtracker, sessionid= 14/01/28 20:04:56 INFO input.fileinputformat: Total input paths to process : 1 14/01/28 20:04:56 INFO mapreduce.jobsubmitter: number of splits:1 14/01/28 20:04:56 WARN conf.configuration: user.name is deprecated. Instead, use mapreduce.job.user.name 14/01/28 20:04:56 WARN conf.configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/01/28 20:04:56 WARN conf.configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/01/28 20:04:56 WARN conf.configuration: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class 14/01/28 20:04:56 WARN conf.configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/01/28 20:04:56 WARN conf.configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/01/28 20:04:56 WARN conf.configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class

14/01/28 20:04:56 WARN conf.configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/01/28 20:04:56 WARN conf.configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/01/28 20:04:56 WARN conf.configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/01/28 20:04:56 WARN conf.configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/01/28 20:04:56 WARN conf.configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/01/28 20:04:56 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_local513207422_0001 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser513207422/.staging/job_local513207422_0001/job.xml:an attempt to override final parameter: mapreduce.local.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser513207422/.staging/job_local513207422_0001/job.xml:an attempt to override final parameter: dfs.namenode.name.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser513207422/.staging/job_local513207422_0001/job.xml:an attempt to override final parameter: mapreduce.job.endnotification.max.retry.interval; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser513207422/.staging/job_local513207422_0001/job.xml:an attempt to override final parameter: dfs.datanode.data.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser513207422/.staging/job_local513207422_0001/job.xml:an attempt to override final parameter: mapreduce.job.endnotification.max.attempts; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser513207422/.staging/job_local513207422_0001/job.xml:an attempt to override final parameter: fs.defaultfs; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser513207422/.staging/job_local513207422_0001/job.xml:an attempt to override final parameter: mapreduce.system.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local513207422_0001.xml:an attempt to override final parameter: mapreduce.local.dir; Ignoring.

14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local513207422_0001.xml:an attempt to override final parameter: dfs.namenode.name.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local513207422_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local513207422_0001.xml:an attempt to override final parameter: dfs.datanode.data.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local513207422_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local513207422_0001.xml:an attempt to override final parameter: fs.defaultfs; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local513207422_0001.xml:an attempt to override final parameter: mapreduce.system.dir; Ignoring. 14/01/28 20:04:56 INFO mapreduce.job: The url to track the job: http://localhost:8080/ 14/01/28 20:04:56 INFO mapreduce.job: Running job: job_local513207422_0001 14/01/28 20:04:56 INFO mapred.localjobrunner: OutputCommitter set in config null 14/01/28 20:04:56 INFO mapred.localjobrunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.fileoutputcommitter 14/01/28 20:04:56 INFO mapred.localjobrunner: Waiting for map tasks 14/01/28 20:04:56 INFO mapred.localjobrunner: Starting task: attempt_local513207422_0001_m_000000_0 14/01/28 20:04:56 INFO mapred.task: Using ResourceCalculatorProcessTree : [ ] 14/01/28 20:04:56 INFO mapred.maptask: Processing split: hdfs://localhost:9000/tmp/4300.txt:0+1573078 14/01/28 20:04:57 INFO mapred.maptask: Map output collector class = org.apache.hadoop.mapred.maptask$mapoutputbuffer 14/01/28 20:04:57 INFO mapred.maptask: (EQUATOR) 0 kvi 26214396(104857584) 14/01/28 20:04:57 INFO mapred.maptask: mapreduce.task.io.sort.mb: 100 14/01/28 20:04:57 INFO mapred.maptask: soft limit at 83886080

14/01/28 20:04:57 INFO mapred.maptask: bufstart = 0; bufvoid = 104857600 14/01/28 20:04:57 INFO mapred.maptask: kvstart = 26214396; length = 6553600 14/01/28 20:04:57 INFO mapred.localjobrunner: 14/01/28 20:04:57 INFO mapred.maptask: Starting flush of map output 14/01/28 20:04:57 INFO mapred.maptask: Spilling map output 14/01/28 20:04:57 INFO mapred.maptask: bufstart = 0; bufend = 2601826; bufvoid = 104857600 14/01/28 20:04:57 INFO mapred.maptask: kvstart = 26214396(104857584); kvend = 25142480(100569920); length = 1071917/6553600 14/01/28 20:04:57 INFO mapreduce.job: Job job_local513207422_0001 running in uber mode : false 14/01/28 20:04:57 INFO mapreduce.job: map 0% reduce 0% 14/01/28 20:04:58 INFO mapred.maptask: Finished spill 0 14/01/28 20:04:58 INFO mapred.task: Task:attempt_local513207422_0001_m_000000_0 is done. And is in the process of committing 14/01/28 20:04:58 INFO mapred.localjobrunner: map 14/01/28 20:04:58 INFO mapred.task: Task 'attempt_local513207422_0001_m_000000_0' done. 14/01/28 20:04:58 INFO mapred.localjobrunner: Finishing task: attempt_local513207422_0001_m_000000_0 14/01/28 20:04:58 INFO mapred.localjobrunner: Map task executor complete. 14/01/28 20:04:58 INFO mapred.task: Using ResourceCalculatorProcessTree : [ ] 14/01/28 20:04:58 INFO mapred.merger: Merging 1 sorted segments 14/01/28 20:04:58 INFO mapred.merger: Down to the last merge-pass, with 1 segments left of total size: 725062 bytes 14/01/28 20:04:58 INFO mapred.localjobrunner: 14/01/28 20:04:58 WARN conf.configuration: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords 14/01/28 20:04:58 INFO mapreduce.job: map 100% reduce 0% 14/01/28 20:04:58 INFO mapred.task: Task:attempt_local513207422_0001_r_000000_0 is done. And is in the process of committing 14/01/28 20:04:58 INFO mapred.localjobrunner:

14/01/28 20:04:58 INFO mapred.task: Task attempt_local513207422_0001_r_000000_0 is allowed to commit now 14/01/28 20:04:58 INFO output.fileoutputcommitter: Saved output of task 'attempt_local513207422_0001_r_000000_0' to hdfs://localhost:9000/tmp/output3/_temporary/0/task_local513207422_0001_r_000000 14/01/28 20:04:58 INFO mapred.localjobrunner: reduce > reduce 14/01/28 20:04:58 INFO mapred.task: Task 'attempt_local513207422_0001_r_000000_0' done. 14/01/28 20:04:59 INFO mapreduce.job: map 100% reduce 100% 14/01/28 20:04:59 INFO mapreduce.job: Job job_local513207422_0001 completed successfully 14/01/28 20:04:59 INFO mapreduce.job: Counters: 32 File System Counters FILE: Number of bytes read=1268626 FILE: Number of bytes written=2363894 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=3146156 HDFS: Number of bytes written=527555 HDFS: Number of read operations=13 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Map-Reduce Framework Map input records=33056 Map output records=267980 Map output bytes=2601826 Map output materialized bytes=725074 Input split bytes=99 Combine input records=267980 Combine output records=50095

Reduce input groups=50095 Reduce shuffle bytes=0 Reduce input records=50095 Reduce output records=50095 Spilled Records=100190 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=37 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=508559360 File Input Format Counters Bytes Read=1573078 File Output Format Counters Bytes Written=527555 hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ see the result by hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -cat /tmp/output3/part* Or from web console

Remove a directory hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -rmr /tmp/output DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. rmr: DEPRECATED: Please use 'rm -r' instead. 14/01/30 09:26:27 INFO fs.trashpolicydefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /tmp/output hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -rmr /tmp/output2 DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it.

rmr: DEPRECATED: Please use 'rm -r' instead. 14/01/30 09:26:56 INFO fs.trashpolicydefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /tmp/output2 Install Hadoop as Master + Slave on RHEL 6.5 General steps 1. Preparation Master Slave Hostname master.hadoop.advol node.hadoop.advol edit /etc/hosts JDK /usr/java/jdk-1.7.0 As master Download rpm from oracle $HADOOP_HOME /data/hadoop/hadoop- 2.4.0 As master Can use separated logic volume as hdfs user hadoop hadoop Adduser hadoop Also grant owner of $HADOOP_HOME to hadoop Hardware Standalone hardware, RHEL 6.5 RHEL 6.5 on VMWARE on my laptop 2. DO THE SAME ON MASTER NODE AND SLAVE NODES TO INSTALL JDK and ENVIRONMENT CONFIGURATION 3. MODIFY HOSTNAME TO ENABLE PING BY HOSTNAME TO EACH OTHER 4. MODIFY FIREWALL IPTABLES 5. ADD SSH PUBLIC KEY FROM MASTER TO SLAVE AS WELL AS FROM SLAVE TO MASTER 6. KEEP DIRECTORY BE THE SAME 7. MODIFY SLAVE FILE IN $HADOOPHOME/ETC 8. FORMAT NAMENODE AGAIN 9. START-ALL FROM MASTER Hadoop-env.sh Add JAVA_HOME # The java implementation to use. export JAVA_HOME=/usr/java/jdk1.7.0_55

core-site.xml <configuration> <name>hadoop.tmp.dir</name> <value>/data/hadoop/hadoopfs/tmp</value> <description>a base for other temporary directories.</description> <name>fs.default.name</name> <value>hdfs://master.hadoop.advol:9000</value> <final>true</final> </configuration> hdfs-site.xml <configuration> <name>dfs.replication</name> <value>2</value> <name>dfs.permissions</name> <value>false</value> <name>dfs.namenode.name.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/name</value>

<final>true</final> <name>dfs.datanode.data.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/data</value> <final>true</final> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </configuration> yarn-site.xml <configuration> <name>dfs.replication</name> <value>2</value> <name>dfs.permissions</name> <value>false</value> <name>dfs.namenode.name.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/name</value> <final>true</final> <name>dfs.datanode.data.dir</name>

<value>file:/data/hadoop/hadoopfs/dfs/data</value> <final>true</final> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </configuration> mapred-site.xml <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> <name>mapreduce.cluster.local.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/local</value> <final>true</final> <name>mapreduce.cluster.temp.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/tmp</value> <description>no description</description> <final>true</final> </configuration>

After start (~/$HADOOP_HOME/sbin/start-all.sh ), use JPS to check JAVA threads, at Master: [hadoop@master sbin]$ jps 15673 ResourceManager 15782 NodeManager 16101 Jps 15519 SecondaryNameNode 15145 NameNode [hadoop@master sbin]$ JPS at SLAVE [hadoop@node ~]$ jps 3934 Jps 3480 DataNode 3583 NodeManager [hadoop@node ~]$ If have issue with copyfromlocal (only one node accept, while another raise a I/O issue), check firewall Check hdfs [hadoop@master hadoop]$ hdfs dfsadmin -report Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/09 17:38:44 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Configured Capacity: 180645183488 (168.24 GB) Present Capacity: 159183417344 (148.25 GB) DFS Remaining: 159180152832 (148.25 GB) DFS Used: 3264512 (3.11 MB) DFS Used%: 0.00%

Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 2 (2 total, 0 dead) Live datanodes: Name: 130.113.68.191:50010 (node.hadoop.advol) Hostname: node.hadoop.advol Decommission Status : Normal Configured Capacity: 18713219072 (17.43 GB) DFS Used: 28672 (28 KB) Non DFS Used: 4541681664 (4.23 GB) DFS Remaining: 14171508736 (13.20 GB) DFS Used%: 0.00% DFS Remaining%: 75.73% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Last contact: Fri May 09 17:38:43 EDT 2014 Name: 130.113.68.234:50010 (master.hadoop.advol) Hostname: master.hadoop.advol Decommission Status : Normal Configured Capacity: 161931964416 (150.81 GB) DFS Used: 3235840 (3.09 MB)

Non DFS Used: 16920084480 (15.76 GB) DFS Remaining: 145008644096 (135.05 GB) DFS Used%: 0.00% DFS Remaining%: 89.55% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Last contact: Fri May 09 17:38:43 EDT 2014 [hadoop@master hadoop]$ Run word count on 1 master + 2 slave nodes Can also check application job tracker at master server http://130.113.68.234:8088/cluster localhost not accepted [hadoop@master ~]$ hadoop jar /data/hadoop/hadoop-2.4.0/share/hadoop/mapreduce/hadoopmapreduce-examples-2.4.0.jar wordcount /tmp/4300.txt /tmp/output4 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 13:04:13 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/12 13:04:14 INFO client.rmproxy: Connecting to ResourceManager at master.hadoop.advol/130.113.68.234:8032 14/05/12 13:04:15 INFO input.fileinputformat: Total input paths to process : 1 14/05/12 13:04:15 INFO mapreduce.jobsubmitter: number of splits:1

14/05/12 13:04:15 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_1399671198293_0001 14/05/12 13:04:16 INFO impl.yarnclientimpl: Submitted application application_1399671198293_0001 14/05/12 13:04:16 INFO mapreduce.job: The url to track the job: http://master.hadoop.advol:8088/proxy/application_1399671198293_0001/ 14/05/12 13:04:16 INFO mapreduce.job: Running job: job_1399671198293_0001 14/05/12 13:04:26 INFO mapreduce.job: Job job_1399671198293_0001 running in uber mode : false 14/05/12 13:04:26 INFO mapreduce.job: map 0% reduce 0% 14/05/12 13:04:35 INFO mapreduce.job: map 100% reduce 0% 14/05/12 13:04:41 INFO mapreduce.job: map 100% reduce 100% 14/05/12 13:04:41 INFO mapreduce.job: Job job_1399671198293_0001 completed successfully 14/05/12 13:04:42 INFO mapreduce.job: Counters: 49 File System Counters FILE: Number of bytes read=725074 FILE: Number of bytes written=1635839 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1573187 HDFS: Number of bytes written=527555 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=6084

Total time spent by all reduces in occupied slots (ms)=3660 Total time spent by all map tasks (ms)=6084 Total time spent by all reduce tasks (ms)=3660 Total vcore-seconds taken by all map tasks=6084 Total vcore-seconds taken by all reduce tasks=3660 Total megabyte-seconds taken by all map tasks=6230016 Total megabyte-seconds taken by all reduce tasks=3747840 Map-Reduce Framework Map input records=33056 Map output records=267980 Map output bytes=2601826 Map output materialized bytes=725074 Input split bytes=109 Combine input records=267980 Combine output records=50095 Reduce input groups=50095 Reduce shuffle bytes=725074 Reduce input records=50095 Reduce output records=50095 Spilled Records=100190 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=286 CPU time spent (ms)=4990 Physical memory (bytes) snapshot=436609024 Virtual memory (bytes) snapshot=1763209216 Total committed heap usage (bytes)=274726912 Shuffle Errors BAD_ID=0

CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1573078 File Output Format Counters Bytes Written=527555 [hadoop@master ~]$ Run word count on master+slave (1 node) [hadoop@master sbin]$ hadoop jar /data/hadoop/hadoop- 2.4.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar wordcount /tmp/4300.txt /tmp/output5 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 13:29:24 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/12 13:29:25 INFO client.rmproxy: Connecting to ResourceManager at master.hadoop.advol/130.113.68.234:8032 14/05/12 13:29:26 INFO input.fileinputformat: Total input paths to process : 1 14/05/12 13:29:26 INFO mapreduce.jobsubmitter: number of splits:1 14/05/12 13:29:26 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_1399915743300_0001 14/05/12 13:29:27 INFO impl.yarnclientimpl: Submitted application application_1399915743300_0001

14/05/12 13:29:27 INFO mapreduce.job: The url to track the job: http://master.hadoop.advol:8088/proxy/application_1399915743300_0001/ 14/05/12 13:29:27 INFO mapreduce.job: Running job: job_1399915743300_0001 14/05/12 13:29:34 INFO mapreduce.job: Job job_1399915743300_0001 running in uber mode : false 14/05/12 13:29:34 INFO mapreduce.job: map 0% reduce 0% 14/05/12 13:29:41 INFO mapreduce.job: map 100% reduce 0% 14/05/12 13:29:47 INFO mapreduce.job: map 100% reduce 100% 14/05/12 13:29:48 INFO mapreduce.job: Job job_1399915743300_0001 completed successfully 14/05/12 13:29:48 INFO mapreduce.job: Counters: 49 File System Counters FILE: Number of bytes read=725074 FILE: Number of bytes written=1635839 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1573187 HDFS: Number of bytes written=527555 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=4254 Total time spent by all reduces in occupied slots (ms)=3904 Total time spent by all map tasks (ms)=4254 Total time spent by all reduce tasks (ms)=3904 Total vcore-seconds taken by all map tasks=4254

Total vcore-seconds taken by all reduce tasks=3904 Total megabyte-seconds taken by all map tasks=4356096 Total megabyte-seconds taken by all reduce tasks=3997696 Map-Reduce Framework Map input records=33056 Map output records=267980 Map output bytes=2601826 Map output materialized bytes=725074 Input split bytes=109 Combine input records=267980 Combine output records=50095 Reduce input groups=50095 Reduce shuffle bytes=725074 Reduce input records=50095 Reduce output records=50095 Spilled Records=100190 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=119 CPU time spent (ms)=5290 Physical memory (bytes) snapshot=442953728 Virtual memory (bytes) snapshot=1764282368 Total committed heap usage (bytes)=305135616 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0

WRONG_REDUCE=0 File Input Format Counters Bytes Read=1573078 File Output Format Counters Bytes Written=527555 [hadoop@master sbin]$ Observation: I don t see any improvement on the result of running on standalone or master+slave Run word count on bible on standalone and cluster Try to work on a bigger file [hadoop@master hadoop]$ wget http://printkjv.ifbweb.com/av_txt.zip [hadoop@master hadoop]$ unzip AV_txt.zip [hadoop@master hadoop]$ mv AV1611Bible.txt bible.txt [hadoop@master hadoop]$ hdfs dfs -copyfromlocal bible.txt /tmp/bible.txt [hadoop@master hadoop]$ hadoop jar /data/hadoop/hadoop- 2.4.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar wordcount /tmp/bible.txt /tmp/bible/ [hadoop@master hadoop]$ hdfs dfs -getmerge /tmp/bible/./bible.result [hadoop@master hadoop]$ cat bible.result grep god Gudgodah 1 Gudgodah; 1 [god] 1 [god]: 1 [gods 1 [gods], 2 ergodiwkthv 1 god 20 god, 19 god. 9 god: 4

god; 3 goddess 4 goddess. 1 godliness 4 godliness) 1 godliness, 4 godliness. 1 godliness: 2 godliness; 3 godly 16 godly, 1 godly-learned 1 gods 93 gods, 88 gods. 28 gods: 15 gods; 8 gods? 7 ungodliness 3 ungodliness. 1 ungodly 17 ungodly, 5 ungodly. 2 ungodly; 2 ungodly? 1

Run it on Master + 2 slave nodes [hadoop@master sbin]$./start-all.sh [hadoop@master sbin]$ jps 853 SecondaryNameNode 598 DataNode 1009 ResourceManager 466 NameNode 1452 Jps 1120 NodeManager [hadoop@master sbin]$ hadoop jar /data/hadoop/hadoop- 2.4.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar wordcount /tmp/bible.txt /tmp/bible2/ Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 14:01:50 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/12 14:01:51 INFO client.rmproxy: Connecting to ResourceManager at master.hadoop.advol/130.113.68.234:8032 14/05/12 14:01:52 INFO input.fileinputformat: Total input paths to process : 1 14/05/12 14:01:52 INFO mapreduce.jobsubmitter: number of splits:1 14/05/12 14:01:52 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_1399917682766_0001 14/05/12 14:01:52 INFO impl.yarnclientimpl: Submitted application application_1399917682766_0001 14/05/12 14:01:52 INFO mapreduce.job: The url to track the job: http://master.hadoop.advol:8088/proxy/application_1399917682766_0001/ 14/05/12 14:01:53 INFO mapreduce.job: Running job: job_1399917682766_0001 14/05/12 14:02:00 INFO mapreduce.job: Job job_1399917682766_0001 running in uber mode : false

14/05/12 14:02:00 INFO mapreduce.job: map 0% reduce 0% 14/05/12 14:02:12 INFO mapreduce.job: map 100% reduce 0% 14/05/12 14:02:21 INFO mapreduce.job: map 100% reduce 100% 14/05/12 14:02:22 INFO mapreduce.job: Job job_1399917682766_0001 completed successfully 14/05/12 14:02:22 INFO mapreduce.job: Counters: 49 File System Counters FILE: Number of bytes read=467766 FILE: Number of bytes written=1121223 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=4397316 HDFS: Number of bytes written=343441 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=10832 Total time spent by all reduces in occupied slots (ms)=5080 Total time spent by all map tasks (ms)=10832 Total time spent by all reduce tasks (ms)=5080 Total vcore-seconds taken by all map tasks=10832 Total vcore-seconds taken by all reduce tasks=5080 Total megabyte-seconds taken by all map tasks=11091968 Total megabyte-seconds taken by all reduce tasks=5201920 Map-Reduce Framework

Map input records=33979 Map output records=840245 Map output bytes=7722868 Map output materialized bytes=467766 Input split bytes=110 Combine input records=840245 Combine output records=32635 Reduce input groups=32635 Reduce shuffle bytes=467766 Reduce input records=32635 Reduce output records=32635 Spilled Records=65270 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=760 CPU time spent (ms)=7800 Physical memory (bytes) snapshot=445231104 Virtual memory (bytes) snapshot=1754845184 Total committed heap usage (bytes)=308281344 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=4397206 File Output Format Counters

Bytes Written=343441 [hadoop@master sbin]$ hdfs dfs -getmerge /tmp/bible/./bible2.result Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 14:07:39 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [hadoop@master sbin]$ cat bible2.result grep god Gudgodah 1 Gudgodah; 1 [god] 1 [god]: 1 [gods 1 [gods], 2 ergodiwkthv 1 god 20 god, 19 god. 9 god: 4 god; 3 goddess 4 goddess. 1 godliness 4 godliness) 1 godliness, 4 godliness. 1 godliness: 2 godliness; 3 godly 16

godly, 1 godly-learned 1 gods 93 gods, 88 gods. 28 gods: 15 gods; 8 gods? 7 ungodliness 3 ungodliness. 1 ungodly 17 ungodly, 5 ungodly. 2 ungodly; 2 ungodly? 1 [hadoop@master sbin]$ Run word count on my emails master + 2 slave nodes Try to run another bigger file Myemails.txt is a export from my outlook which includes 3 years email [hadoop@master hadoop]$ hadoop jar /data/hadoop/hadoop- 2.4.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar wordcount /tmp/myemails.txt /tmp/myemail/ Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 14:54:45 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/12 14:54:46 INFO client.rmproxy: Connecting to ResourceManager at master.hadoop.advol/130.113.68.234:8032 14/05/12 14:54:46 INFO input.fileinputformat: Total input paths to process : 1 14/05/12 14:54:46 INFO mapreduce.jobsubmitter: number of splits:1 14/05/12 14:54:47 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_1399917682766_0002 14/05/12 14:54:47 INFO impl.yarnclientimpl: Submitted application application_1399917682766_0002 14/05/12 14:54:47 INFO mapreduce.job: The url to track the job: http://master.hadoop.advol:8088/proxy/application_1399917682766_0002/ 14/05/12 14:54:47 INFO mapreduce.job: Running job: job_1399917682766_0002 14/05/12 14:54:54 INFO mapreduce.job: Job job_1399917682766_0002 running in uber mode : false 14/05/12 14:54:54 INFO mapreduce.job: map 0% reduce 0% 14/05/12 14:55:10 INFO mapreduce.job: map 21% reduce 0% 14/05/12 14:55:13 INFO mapreduce.job: map 24% reduce 0% 14/05/12 14:55:16 INFO mapreduce.job: map 43% reduce 0% 14/05/12 14:55:19 INFO mapreduce.job: map 47% reduce 0% 14/05/12 14:55:22 INFO mapreduce.job: map 62% reduce 0% 14/05/12 14:55:25 INFO mapreduce.job: map 67% reduce 0% 14/05/12 14:55:26 INFO mapreduce.job: map 100% reduce 0% 14/05/12 14:55:33 INFO mapreduce.job: map 100% reduce 100% 14/05/12 14:55:33 INFO mapreduce.job: Job job_1399917682766_0002 completed successfully 14/05/12 14:55:33 INFO mapreduce.job: Counters: 49 File System Counters FILE: Number of bytes read=15513937 FILE: Number of bytes written=22690574

FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=86012550 HDFS: Number of bytes written=6281696 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=29563 Total time spent by all reduces in occupied slots (ms)=4477 Total time spent by all map tasks (ms)=29563 Total time spent by all reduce tasks (ms)=4477 Total vcore-seconds taken by all map tasks=29563 Total vcore-seconds taken by all reduce tasks=4477 Total megabyte-seconds taken by all map tasks=30272512 Total megabyte-seconds taken by all reduce tasks=4584448 Map-Reduce Framework Map input records=3597155 Map output records=10216704 Map output bytes=125299964 Map output materialized bytes=6990938 Input split bytes=113 Combine input records=10489013 Combine output records=456869 Reduce input groups=184560 Reduce shuffle bytes=6990938

Reduce input records=184560 Reduce output records=184560 Spilled Records=641429 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=604 CPU time spent (ms)=32020 Physical memory (bytes) snapshot=457576448 Virtual memory (bytes) snapshot=1773580288 Total committed heap usage (bytes)=304611328 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=86012437 File Output Format Counters Bytes Written=6281696 [hadoop@master hadoop]$

Start history server Not start by default [hadoop@master sbin]$./mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /data/hadoop/hadoop-2.4.0/logs/mapred-hadoophistoryserver-master.hadoop.advol.out [hadoop@master sbin]$ Install Hbase Understand what is Hadoop Hbase Hive Pig hive 与 hbase 区别 1 hive 是 sql 语言, 通过数据库的方式来操作 hdfs 文件系统, 为了简化编程, 底层计算方式为 mapreduce 2 hive 是面向行存储的数据库 3 Hive 本身不存储和计算数据, 它完全依赖于 HDFS 和 MapReduce,Hive 中的表纯逻辑 4 HBase 为查询而生的, 它通过组织起节点內所有机器的內存, 提供一個超大的內存 Hash 表 5 hbase 不是关系型数据库, 而是一个在 hdfs 上开发的面向列的分布式数据库, 不支持 sql 6 hbase 是物理表, 不是逻辑表, 提供一个超大的内存 hash 表, 搜索引擎通过它来存储索引, 方便查询操作

7 hbase 是列存储 Pig Pig 是一种数据流语言, 用来快速轻松的处理巨大的数据 Pig 包含两个部分 :Pig Interface,Pig Latin Pig 可以非常方便的处理 HDFS 和 HBase 的数据, 和 Hive 一样,Pig 可以非常高效的处理其需要做的, 通过直接操作 Pig 查询可以节省大量的劳动和时间当你想在你的数据上做一些转换, 并且不想编写 MapReduce jobs 就可以用 Pig. Hive 起源于 FaceBook,Hive 在 Hadoop 中扮演数据仓库的角色建立在 Hadoop 集群的最顶层, 对存储在 Hadoop 群上的数据提供类 SQL 的接口进行操作你可以用 HiveQL 进行 select,join, 等等操作如果你有数据仓库的需求并且你擅长写 SQL 并且不想写 MapReduce jobs 就可以用 Hive 代替 HBase HBase 作为面向列的数据库运行在 HDFS 之上,HDFS 缺乏随即读写操作,HBase 正是为此而出现 HBase 以 Google BigTable 为蓝本, 以键值对的形式存储项目的目标就是快速在主机内数十亿行数据中定位所需的数据并访问它 HBase 是一个数据库, 一个 NoSql 的数据库, 像其他数据库一样提供随即读写功能,Hadoop 不能满足实时需要,HBase 正可以满足如果你需要实时访问一些数据, 就把它存入 HBase 你可以用 Hadoop 作为静态数据仓库,HBase 作为数据存储, 放那些进行一些操作会改变的数据 Pig VS Hive Hive 更适合于数据仓库的任务,Hive 主要用于静态的结构以及需要经常分析的工作 Hive 与 SQL 相似促使其成为 Hadoop 与其他 BI 工具结合的理想交集 Pig 赋予开发人员在大数据集领域更多的灵活性, 并允许开发简洁的脚本用于转换数据流以便嵌入到较大的应用程序 Pig 相比 Hive 相对轻量, 它主要的优势是相比于直接使用 Hadoop Java APIs 可大幅削减代码量正因为如此,Pig 仍然是吸引大量的软件开发人员 Hive 和 Pig 都可以与 HBase 组合使用,Hive 和 Pig 还为 HBase 提供了高层语言支持, 使得在 HBase 上进行数据统计处理变的非常简单 Hive VS HBase Hive 是建立在 Hadoop 之上为了减少 MapReduce jobs 编写工作的批处理系统,HBase 是为了支持弥补 Hadoop 对实时操作的缺陷的项目想象你在操作 RMDB 数据库, 如果是全表扫描, 就用 Hive+Hadoop, 如果是索引访问, 就用 HBase+Hadoop Hive query 就是 MapReduce jobs 可以从 5 分钟到数小时不止,HBase 是非常高效的, 肯定比 Hive 高效的多 If see:

org.apache.hadoop.ipc.remoteexception: Server IPC version 9 cannot communicate with client version 4 Check Hbase version and Hadoop version [hadoop@master download]$ wget http://mirror.csclub.uwaterloo.ca/apache/hbase/hbase- 0.98.2/hbase-0.98.2-hadoop2-bin.tar.gz --2014-05-13 18:54:18-- http://mirror.csclub.uwaterloo.ca/apache/hbase/hbase- 0.98.2/hbase-0.98.2-hadoop2-bin.tar.gz Resolving mirror.csclub.uwaterloo.ca... 129.97.134.71 Connecting to mirror.csclub.uwaterloo.ca 129.97.134.71 :80... connected. HTTP request sent, awaiting response... 200 OK Length: 82246622 (78M) [application/x-gzip] Saving to: 鈥渉 base-0.98.2-hadoop2-bin.tar.gz 鈥? 100%[==================================================================>] 82,246,622 20.5M/s in 3.9s 2014-05-13 18:54:22 (20.0 MB/s) - 鈥渉 base-0.98.2-hadoop2-bin.tar.gz 鈥?saved [82246622/82246622] [hadoop@master download]$ tar xfz hbase-0.98.2-hadoop2-bin.tar.gz -C /data/hbase/ Download and modify [hadoop@master hbase-0.98.2-hadoop2]$ cd conf [hadoop@master conf]$ ls hadoop-metrics2-hbase.properties hbase-env.sh hbase-site.xml regionservers hbase-env.cmd hbase-policy.xml log4j.properties [hadoop@master conf]$ more hbase-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<name>hbase.cluster.distributed</name> <value>true</value> <name>hbase.rootdir</name> <value>hdfs://master.hadoop.advol:9000/hbase</value> </configuration> Create a table card and add a record [hadoop@master bin]$./hbase shell hbase(main):019:0> create 'card','mycf' 0 row(s) in 0.7670 seconds => Hbase::Table - card hbase(main):020:0> list 'card' TABLE card 1 row(s) in 0.0360 seconds => ["card"] hbase(main):021:0> put 'card','rowid1','mycf:a','sword of Truth' 0 row(s) in 0.1260 seconds hbase(main):022:0> list 'card' TABLE card 1 row(s) in 0.0330 seconds => ["card"]

hbase(main):024:0> scan 'card' ROW rowid1 Truth COLUMN+CELL column=mycf:a, timestamp=1400090875012, value=sword of 1 row(s) in 0.0400 seconds hbase(main):025:0> [hadoop@master bin]$ [hadoop@master bin]$ [hadoop@master bin]$./stop-hbase.sh stopping hbase... localhost: stopping zookeeper. [hadoop@master bin]$ jps 8707 NameNode 9248 ResourceManager 9092 SecondaryNameNode 11217 Jps 8837 DataNode 9359 NodeManager 5532 JobHistoryServer [hadoop@master bin]$./start-hbase.sh localhost: starting zookeeper, logging to /data/hbase/hbase-0.98.2- hadoop2/bin/../logs/hbase-hadoop-zookeeper-master.hadoop.advol.out starting master, logging to /data/hbase/hbase-0.98.2-hadoop2/bin/../logs/hbase-hadoopmaster-master.hadoop.advol.out localhost: starting regionserver, logging to /data/hbase/hbase-0.98.2- hadoop2/bin/../logs/hbase-hadoop-regionserver-master.hadoop.advol.out [hadoop@master bin]$./hbase shell 2014-05-14 14:10:51,394 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available HBase Shell; enter 'help<return>' for list of supported commands.

Type "exit<return>" to leave the HBase Shell Version 0.98.2-hadoop2, r1591526, Wed Apr 30 20:17:33 PDT 2014 hbase(main):001:0> list TABLE SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/hbase/hbase-0.98.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/staticloggerbinder.class] SLF4J: Found binding in [jar:file:/data/hadoop/hadoop- 2.4.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 2014-05-14 14:10:54,583 WARN [main] util.nativecodeloader: Unable to load nativehadoop library for your platform... using builtin-java classes where applicable card 1 row(s) in 1.4430 seconds => ["card"] hbase(main):003:0> scan 'card' ROW rowid1 Truth COLUMN+CELL column=mycf:a, timestamp=1400090875012, value=sword of 1 row(s) in 0.1200 seconds hbase(main):004:0> Also can find a file was created in hdfs