Hadoop for Java Developers [HOL1813]

Similar documents
Tutorial. Christopher M. Judd

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Tutorial- Counting Words in File(s) using MapReduce

Tutorial. Christopher M. Judd

Hadoop Basics with InfoSphere BigInsights

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Word Count Code using MR2 Classes and API

BIG DATA APPLICATIONS

Big Data 2012 Hadoop Tutorial

How To Install Hadoop From Apa Hadoop To (Hadoop)

hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop Configuration and First Examples

HADOOP - MULTI NODE CLUSTER

19 Putting into Practice: Large-Scale Data Management with HADOOP

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

HSearch Installation

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Hadoop (pseudo-distributed) installation and configuration

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Introduc)on to Map- Reduce. Vincent Leroy

Single Node Hadoop Cluster Setup

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Hadoop Installation MapReduce Examples Jake Karnes

Running Hadoop on Windows CCNP Server

IDS 561 Big data analytics Assignment 1

Hadoop Installation. Sandeep Prasad

Installation and Configuration Documentation

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Hadoop MultiNode Cluster Setup

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Hadoop Shell Commands

HDFS File System Shell Guide

Single Node Setup. Table of contents

Hadoop Shell Commands

2.1 Hadoop a. Hadoop Installation & Configuration

Extreme computing lab exercises Session one

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Hadoop Tutorial. General Instructions

HDFS Installation and Shell

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Mrs: MapReduce for Scientific Computing in Python

BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader. Arijit Das Greg Belli Erik Lowney Nick Bitto

Running Kmeans Mapreduce code on Amazon AWS

How to install Apache Hadoop in Ubuntu (Multi node setup)

TP1: Getting Started with Hadoop

Introduction to Cloud Computing

HADOOP CLUSTER SETUP GUIDE:

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

File System Shell Guide

Hadoop Training Hands On Exercise

CDH 5 Quick Start Guide

Getting to know Apache Hadoop

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Hadoop Installation Guide

Hadoop and Spark Tutorial for Statisticians

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Word count example Abdalrahman Alsaedi

Data Science Analytics & Research Centre

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Comparative Study of Hive and Map Reduce to Analyze Big Data

Extreme computing lab exercises Session one

BIG DATA ANALYTICS USING HADOOP TOOLS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Introduction to HDFS. Prasanth Kothuri, CERN

Tableau Spark SQL Setup Instructions

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Distributed Filesystems

Apache Hadoop new way for the company to store and analyze big data

Using The Hortonworks Virtual Sandbox

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop for HPCers: A Hands-On Introduction

Hadoop Lab - Setting a 3 node Cluster. Java -

Setting up Hadoop with MongoDB on Windows 7 64-bit

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

How to Run Spark Application

Extreme Computing. Hadoop MapReduce in more detail.

Introduction to MapReduce and Hadoop

How To Use Hadoop

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Deploying MongoDB and Hadoop to Amazon Web Services

RDMA for Apache Hadoop User Guide

Enterprise Data Storage and Analysis on Tim Barr

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Transcription:

by Christopher M. Judd (javajudd@gmail.com) Contents 1 Lab 1 - Run First Hadoop Job............................ 1 Lab 2 - Run Hadoop in a Pseudo-Distributed mode................. 2 Lab 3 - Utilize HDFS................................ 3 Lab 4 - Combine Hadoop & HDFS.......................... 5 Lab 5 - Write first Hadoop Map/Reduce Job..................... 5 Appendix A - Environment Setup 8 Tutorial Dependencies............................... 8 Add Hadoop User and Group............................ 8 Install Hadoop.................................... 9 Configure YARN................................... 9 Configure HDFS.................................. 10 Git project templates................................ 11 This hands-on lab assumes you are logged into a virtual machine that completed the Appendix A - Environment Setup as the user hduser/hduser. Lab 1 - Run First Hadoop Job This lab runs the pi example provided with Hadoop. This example calculates pi by using the Quasi-Monte Carlos method which does not give you an exact value of pi but instead uses a sample set of randomly generated numbers and an equation to approximate pi. Therefore when 1

you run this job, you will specify how many sample sets to run and how many random numbers you want the job to generate in each sample set. The goals of this lab are to: insure Hadoop is install properly learn how to run a Hadoop job in local standalone mode discover what examples are included in Hadoop 1. Run the pi example with 4 sets of data containing 1000 random samples in each. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar pi 4 1000 2. What is the estimated value of pi? 3. To see other examples included in Hadoop run the same examples jar with no parameters. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar 4. What other examples are there? 5. To see the parameter options an example run the example jar with the name of example. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar pi Lab 2 - Run Hadoop in a Pseudo-Distributed mode Hadoop s strength is it can run in a distributed manor across on commondity hardware. This lab will simulate running Hadoop in a distributed manor. The goals of this lab are to: be able to start and stop YARN run a Hadoop job in a distributed manor view node activity 1. Edit a MapReduce config file and remove commented out XML. $ sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> </configuration> 01-14-2015 03:54:19 PM 2

2. Start YARN. $ start-yarn.sh 3. Verify YARN is running and that you see the NodeManager and ResourceManager processes. $ jps 4. Run the same pi example as earilier. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar pi 4 1000 5. View the applications/jobs and their containers while the job is running. Open a browser and navigate to http://localhost:8042/node. 6. Click on List of Applicaitons in the right menu. 7. Select the running application. 8. Refresh the browser watching the containers come and go with the demand. Lab 3 - Utilize HDFS This lab uses Hadoop s distributed file system, HDFS, to store and distributed files. The goals of this lab are to: be able to start and stop HDFS create directories in HDFS copy files to HDFS use basic HDFS commands use the web application to view the filesystem and check HDFS s health NOTE: If you receive the warning WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable it is just a warning and can be ignored. It is because Hadoop was compiled for 32-bit but we are running on a 64-bit system. 1. Configure the Hadoop core to know about HDFS by removing commented out XML. $ sudo vim $HADOOP_HOME/etc/hadoop/core-site.xml <configuration> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </configuration> 2. Start HDFS. $ start-dfs.sh 01-14-2015 03:54:19 PM 3

3. Verify HDFS is running and that you see the NameNode, DataNode and SecondaryNameNode processes in addition to the YARN processes. $ jps 4. List the contents of the HDFS root. $ hdfs dfs -ls / 5. Make a directory called books. $ hdfs dfs -mkdir /books 6. List the contents of the HDFS root again to make sure you see the book directory. $ hdfs dfs -ls / 7. List the contents of the books directory. $ hdfs dfs -ls /books 8. Copy the Moby Dick book to the books directory on HDFS. $ hdfs dfs -copyfromlocal /opt/data/moby_dick.txt /books 9. Display the contents of Moby Dick from HDFS. $ hdfs dfs -cat /books/moby_dick.txt 10. Investigate some other HDFS commands. appendtofile cat chgrp chmod chown copyfromlocal copytolocal count cp du get ls lsr mkdir movefromlocal movetolocal mv put rm rmr stat tail test text touchz 01-14-2015 03:54:19 PM 4

11. View the health of HDFS by opening a browser and navigate to http://localhost:50070/dfshealth.jsp. 12. Browser the filesystem and find Moby Dick. Lab 4 - Combine Hadoop & HDFS This lab will combine the distributed processing of Hadoop with the distributed filesystem of HDFS to create a distributed solution. The goals of this lab are to: run a Hadoop job that uses data stored on HDFS view the results of a Hadoop job 1. Run the Hadoop wordcount example using the Moby Dick book as input and outputing the results to a new out directory. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar wordcount /books out 2. Determine the files produced by the Hadoop job. $ hdfs dfs -ls out 3. View the contents of the _SUCCESS file. $ hdfs dfs -cat out/_success 4. What is the purpose of the _SUCCESS file? 5. View the contents of the results file. $ hdfs dfs -cat out/part-r-00000 6. How many times was zeal found in Moby Dick? 7. Run the job again. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar wordcount /books out 8. What was the output? Why was that the output? Lab 5 - Write first Hadoop Map/Reduce Job This lab includes writing your first mapper, reducer and job file and then runs them. The goals of this lab are to: write a custom mapper write a custom reducer write a custom job 1. Change directory to the project workspace containing a project template. $ cd ~/workspaces/hadoop-wordcount 01-14-2015 03:54:19 PM 5

2. Create Mapper class. $ mkdir -p src/main/java/com/manifestcorp/hadoop/wc $ sudo vim src/main/java/com/manifestcorp/hadoop/wc/wordcountmapper.java package com.manifestcorp.hadoop.wc; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> { private static final String SPACE = ; private static final IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.tostring().split(space); for (String str: words) { word.set(str); context.write(word, ONE); 3. Create Reducer class. $ sudo vim src/main/java/com/manifestcorp/hadoop/wc/wordcountreducer.java package com.manifestcorp.hadoop.wc; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { 01-14-2015 03:54:19 PM 6

public void reduce(text key, Iterable<IntWritable> values, Context context) int total = 0; throws IOException, InterruptedException { for (IntWritable value : values) { total++; context.write(key, new IntWritable(total)); 4. Create Job class. $ sudo vim src/main/java/com/manifestcorp/hadoop/wc/mywordcount.java package com.manifestcorp.hadoop.wc; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; public class MyWordCount { public static void main(string[] args) throws Exception { Job job = new Job(); job.setjobname( my word count ); job.setjarbyclass(mywordcount.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setmapperclass(wordcountmapper.class); job.setreducerclass(wordcountreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); System.exit(job.waitForCompletion(true)? 0 : 1); 01-14-2015 03:54:19 PM 7

Appendix A - Environment Setup 5. Build project. $ mvn clean package 6. Run job. $ hadoop jar target/hadoop-mywordcount-0.0.1-snapshot.jar com.manifestcorp.hadoop.wc.mywordcount /books out-2 7. Determine the files produced by the Hadoop job. $ hdfs dfs -ls out-2 8. View the contents of the _SUCCESS file. $ hdfs dfs -cat out-2/_success 9. What is the purpose of the _SUCCESS file? 10. View the contents of the results file. $ hdfs dfs -cat out-2/part-r-00000 11. How many times was zeal found in Moby Dick? Appendix A - Environment Setup This hands-on lab assumes you have virtual machine running Ubuntu 12.04 LTS and you have logged in with a user that has sudo privilages. Tutorial Dependencies 1. Make directory to store dependencies. $ sudo mkdir /opt/data 2. Change directory to the newly created directory. $ cd /opt/data 3. Download Hadoop binary. $ sudo wget http://www.trieuvan.com/apache/hadoop/common/hadoop- 2.2.0/hadoop-2.2.0.tar.gz 4. Optionally download Hadoop source code. $ sudo wget http://www.trieuvan.com/apache/hadoop/common/hadoop- 2.2.0/hadoop-2.2.0-src.tar.gz 5. Download copy of Moby Dick and rename it. $ sudo wget http://www.gutenberg.org/ebooks/2489.txt.utf-8 $ mv 2489.txt.utf-8 moby_dick.txt Add Hadoop User and Group 1. Add hadoop group. $ sudo addgroup hadoop 2. Add hduser and put them in the hadoop group. $ sudo adduser --ingroup hadoop hduser 01-14-2015 03:54:19 PM 8

Appendix A - Environment Setup 3. Give sudo rights to hduser. $ sudo adduser hduser sudo 4. Change user to hduser. $ su hduser 5. Change directory to the hduser s home directory. $ cd ~ Install Hadoop 1. Make a root directory for the Hadoop install. $ sudo mkdir -p /opt/hadoop 2. Uncompress the Hadoop tar file. $ sudo tar vxzf /opt/data/hadoop-2.2.0.tar.gz -C /opt/hadoop 3. Make the hduser the owner of the Hadoop home directory recursively. $ sudo chown -R hduser:hadoop /opt/hadoop/hadoop-2.2.0 4. Using a text editor, add environment variables to the end of hduser s.bashrc file. # other stuff # java variables export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # hadoop variables export HADOOP_HOME=/opt/hadoop/hadoop-2.2.0 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME 5. Apply the environment variables. $ source.bashrc 6. Make hadoop is installed and available. $ hadoop version Configure YARN 1. From the /opt/hadoop/hadoop-2.2.0 directory, edit the YARN config file. $ sudo vim etc/hadoop/yarn-site.xml 01-14-2015 03:54:19 PM 9

Appendix A - Environment Setup <configuration> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.shufflehandler</value> </configuration> 2. Copy and edit a MapReduce config file. $ sudo mv etc/hadoop/mapred-site.xml.template etc/hadoop/mapredsite.xml $ sudo vim etc/hadoop/mapred-site.xml <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> </configuration> 3. As the hduser, configure a passwordless login. $ ssh-keygen -t rsa -P $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 4. Validate the passwordless loging works. $ ssh localhost $ exit 5. Configure the JAVA_HOME environment variable. $ vim etc/hadoop/hadoop-env.sh # other stuff export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # more stuff Configure HDFS 1. Make directory for the Name Node. $ sudo mkdir -p /opt/hdfs/namenode 2. Make directory for the Data Node. $ sudo mkdir -p /opt/hdfs/datanode 01-14-2015 03:54:19 PM 10

Appendix A - Environment Setup 3. Update permissions on the previous directories to make them write able. $ sudo chmod -R 777 /opt/hdfs $ sudo chown -R hduser:hadoop /opt/hdfs 4. Update hdfs config file. $ cd /opt/hadoop/hadoop-2.2.0 $ sudo vim etc/hadoop/hdfs-site.xml <?xml version= 1.0 encoding= UTF-8?> <?xml-stylesheet type= text/xsl href= configuration.xsl?> <configuration> <!-- number of hdfs instances to replicate to. --> <name>dfs.replication</name> <value>1</value> <!-- location of Name Node data --> <name>dfs.namenode.name.dir</name> <value>file:/opt/hdfs/namenode</value> <!-- location of Data Node data --> <name>dfs.datanode.data.dir</name> <value>file:/opt/hdfs/datanode</value> </configuration> 5. Format HDFS. $ hdfs namenode -format 6. Configure the Hadoop core to know about HDFS. $ sudo vim etc/hadoop/core-site.xml <configuration> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </configuration> Git project templates 1. Change directory to the workspaces directory. $ cd ~/workspaces 01-14-2015 03:54:19 PM 11

Appendix A - Environment Setup 2. Clone hadoop-logs project. $ git clone https://github.com/cjudd/hadoop-logs.git 3. Clone hadoop-wordcount project. $ git clone https://github.com/cjudd/hadoop-wordcount.git 01-14-2015 03:54:19 PM 12