Basic Hadoop Programming Skills

Similar documents
Hadoop (pseudo-distributed) installation and configuration

Hadoop Installation MapReduce Examples Jake Karnes

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

MapReduce. Tushar B. Kute,

CS2510 Computer Operating Systems Hadoop Examples Guide

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Hadoop Tutorial. General Instructions

To reduce or not to reduce, that is the question

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

cloud-kepler Documentation

Extreme computing lab exercises Session one

IDS 561 Big data analytics Assignment 1

Hadoop Hands-On Exercises

How To Install Hadoop From Apa Hadoop To (Hadoop)

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Hadoop Shell Commands

Tutorial- Counting Words in File(s) using MapReduce

CS 455 Spring Word Count Example

Extreme computing lab exercises Session one

HSearch Installation

Introduction to MapReduce and Hadoop

Hadoop Training Hands On Exercise

HDFS File System Shell Guide

map/reduce connected components

Hadoop Shell Commands

File System Shell Guide

Hadoop Hands-On Exercises

TP1: Getting Started with Hadoop

HADOOP - MULTI NODE CLUSTER

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Step 5: This is the final step in which I observe how many times each word is associated to a word. And

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Using BAC Hadoop Cluster

Hadoop Tutorial GridKa School 2011

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

A. Aiken & K. Olukotun PA3

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

CSE-E5430 Scalable Cloud Computing. Lecture 4

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

CS 378 Big Data Programming

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Hadoop Streaming. Table of contents

Introduction to Cloud Computing

Command Line Crash Course For Unix

Hadoop MultiNode Cluster Setup

How To Write A Mapreduce Program In Java.Io (Orchestra)

Running Hadoop on Windows CCNP Server

HADOOP CLUSTER SETUP GUIDE:

Hadoop Basics with InfoSphere BigInsights

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Hadoop Basics with InfoSphere BigInsights

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Single Node Setup. Table of contents

Unix Sampler. PEOPLE whoami id who

Single Node Hadoop Cluster Setup

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

Creating a Java application using Perfect Developer and the Java Develo...

hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

How to install Apache Hadoop in Ubuntu (Multi node setup)

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Introduction to HDFS. Prasanth Kothuri, CERN

Code Estimation Tools Directions for a Services Engagement

SparkLab May 2015 An Introduction to

How To Use Hadoop

RDMA for Apache Hadoop User Guide

Command-Line Operations : The Shell. Don't fear the command line...

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

QUICK START BASIC LINUX AND G++ COMMANDS. Prepared By: Pn. Azura Bt Ishak

ATLAS Tier 3

Big Data 2012 Hadoop Tutorial

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Running Kmeans Mapreduce code on Amazon AWS

Running Knn Spark on EC2 Documentation

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

MapReduce, Hadoop and Amazon AWS

Hadoop 2.6 Configuration and More Examples

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Virtual Machine (VM) For Hadoop Training

How to Run Spark Application

Hands-on Exercises with Big Data

Installation and Configuration Documentation

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Spectrum Spatial Analyst Version 4.0. Installation Guide for Linux. Contents:

HDFS Installation and Shell

1. Downloading. 2. Installation and License Acquiring. Xilinx ISE Webpack + Project Setup Instructions

Hadoop Configuration and First Examples

Assignment 1: MapReduce with Hadoop

Transcription:

Basic Hadoop Programming Skills

Basic commands of Ubuntu Open file explorer

Basic commands of Ubuntu Open terminal

Basic commands of Ubuntu Open new tabs in terminal Typically, one tab for compiling source codes One tab for running Hadoop

Basic shell commands (in the terminal) List directory hadoop@ubuntu-virtualbox:~$ ls Create directory hadoop@ubuntu-virtualbox:~$ mkdir project Browse into directory hadoop@ubuntu-virtualbox:~/$ cd project Download file hadoop@ubuntu-virtualbox:~/project$ wget http://www.comp.nus.edu.sg/~shilei/download/cs5344-examples.zip

Basic commands of Ubuntu Extract the downloaded zip files

Start/stop Hadoop Re-format the HDFS (all data will be deleted) hadoop@ubuntu-virtualbox:~$ hadoop namenode -format Start Hadoop hadoop@ubuntu-virtualbox:~$ start-all.sh See if Hadoop is running hadoop@ubuntu-virtualbox:~$ jps hadoop@ubuntu-virtualbox:~$ hadoop dfsadmin -report Stop Hadoop (when you are done) hadoop@ubuntu-virtualbox:~$ stop-all.sh

Hadoop Web Interfaces Browse the followings in web browser HDFS status: http://localhost:50030 Hadoop job status: http://localhost:50070

Basic Hadoop Commands HDFS shell commands: Ø Create/remove folder: hadoop fs mkdir/- rmr FOLDER_NAME hadoop@ubuntu- VirtualBox:~$ hadoop fs - mkdir /data hadoop@ubuntu- VirtualBox:~$ hadoop fs - mkdir /data/input Ø List folder: hadoop fs ls PATH hadoop@ubuntu- VirtualBox:~$ hadoop fs ls /data

Basic Hadoop Commands HDFS shell commands: Ø Data transfering: hadoop fs cp/- mv/- put/- get src dest hadoop@ubuntu- VirtualBox:~$ hadoop fs - put project/cs5344- examples/txt/* /data/input

Compile source code Compile hadoop code: javac -classpath `hadoop classpath` -d destination_dir source_dir/filename.java (single quotation mark is the one above tab key) Generate a jar file: jar -cvf WordCount.jar -C destination_dir. (there is a dot in the end)

Compile WordCount example Browse to source code directory hadoop@ubuntu-virtualbox:~/$ cd project/cs5344-examples/src Create directory to store compiled classes hadoop@ubuntu-virtualbox:~/project/cs5344- examples/src$ mkdir classes Compile WordCount code: hadoop@ubuntu-virtualbox:~/project/cs5344- examples/src$ javac -classpath `hadoop classpath` -d classes./wordcount.java (single quotation mark is the one above tab key) Generate a jar file: hadoop@ubuntu-virtualbox:~/project/cs5344- examples/src$ jar -cvf WordCount.jar -C classes/. (there is a dot in the end)

Basic Hadoop Commands Launch job commands: hadoop jar PATH_TO_JAR_FILE classname parameters E.g., launch the above compiled WordCount hadoop@ubuntu- VirtualBox:~/project/CS5344- examples/src$ hadoop jar WordCount.jar myhadoop.wordcount /data/input /data/output Display the job results: hadoop@ubuntu- VirtualBox:~$ hadoop fs ls /data/output hadoop fs - cat /data/output/part- r- 00000 Ø Note: if you run a job multiple times, need to delete the output folder every time before you launch the job hadoop@ubuntu- VirtualBox:~$ hadoop fs rmr /data/output

Customize number of reducers Edit WordCount.java in the Text Editor job.setnumreducetasks(1) à job.setnumreducetasks(4) Compile, generate jar, launch the job again, and display the results (there will be 4 output files)

Adding combiner to map The combiner is already included in the WordCount.java example job.setcombinerclass(intsumreducer.class); This combiner uses the same class as reducer, because sum is associative aggregation. The job result would be the same as without using a combiner. We cannot tell the difference in performance (running time) because the input data / shuffled data are not large enough.

Map-only jobs Edit WordCount.java in the Text Editor job.setnumreducetasks(1) à job.setnumreducetasks(0) Compile, generate jar, launch the job again, and display the results (there will be 3 output files which correspond to 3 maps run on 3 input files, i.e., the results are not aggregated by any reducer)

Work with different input format Write 2 MR jobs WordCountSO is the wordcount but write results in SequenceFile output format (see the sample code in WordCountSO.java: job.setoutputformatclass(sequencefileoutputformat.class);) WordCountSort takes the above output (SequenceFile) as input, and this job performs a sorting on the frequency of the words (see sample code in WordCountSort.java: job.setinputformatclass(sequencefileinputformat.class);)

Work with different input format Compile WordCountSO.java and WordCountSort.java Generate the WordCount.jar file Execute the 2 MR jobs sequentially: hadoop jar WordCount.jar myhadoop.wordcountso /data/input /data/output hadoop jar WordCount.jar myhadoop.wordcountsort /data/output/part-r-00000 /data/sortoutput Display final results: hadoop fs -cat /data/sortoutput/part-r-00000