Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay



Similar documents
CS 455 Spring Word Count Example

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Hadoop Installation. Sandeep Prasad

Hands-on Exercises with Big Data

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Apache Hadoop new way for the company to store and analyze big data

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

How To Install Hadoop From Apa Hadoop To (Hadoop)

Tutorial for Assignment 2.0

Reduction of Data at Namenode in HDFS using harballing Technique

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

How MapReduce Works 資碩一 戴睿宸

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Hadoop Installation MapReduce Examples Jake Karnes

Single Node Hadoop Cluster Setup

TP1: Getting Started with Hadoop

Single Node Setup. Table of contents

Tutorial for Assignment 2.0

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

MapReduce and Hadoop Distributed File System

Hadoop Tutorial. General Instructions

Autoscaling Hadoop Clusters

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Getting to know Apache Hadoop

To reduce or not to reduce, that is the question

Extreme computing lab exercises Session one

IDS 561 Big data analytics Assignment 1

Centrify Server Suite For MapR 4.1 Hadoop With Multiple Clusters in Active Directory

MapReduce, Hadoop and Amazon AWS

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

Extreme computing lab exercises Session one

Apache Hadoop. Alexandru Costan

Introduction to MapReduce and Hadoop

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Hadoop (pseudo-distributed) installation and configuration

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Tutorial- Counting Words in File(s) using MapReduce

Distributed Filesystems

CDH 5 Quick Start Guide

2.1 Hadoop a. Hadoop Installation & Configuration

Hadoop MultiNode Cluster Setup

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

INSTALLING MALTED 3.0 IN LINUX MALTED: INSTALLING THE SYSTEM IN LINUX. Installing Malted 3.0 in LINUX

Basic Hadoop Programming Skills

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

MapReduce Evaluator: User Guide

H2O on Hadoop. September 30,

MapReduce and Hadoop Distributed File System V I J A Y R A O

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

RDMA for Apache Hadoop User Guide

A. Aiken & K. Olukotun PA3

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Running Hadoop on Windows CCNP Server

HiBench Installation. Sunil Raiyani, Jayam Modi

Jeffrey D. Ullman slides. MapReduce for data intensive computing

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

HADOOP MOCK TEST HADOOP MOCK TEST II

Recommended Literature for this Lecture

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Big Data Analytics Using R

Introduction to Cloud Computing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

cloud-kepler Documentation

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Comparison of Different Implementation of Inverted Indexes in Hadoop

How to install Apache Hadoop in Ubuntu (Multi node setup)

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

HDFS. Hadoop Distributed File System

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Mrs: MapReduce for Scientific Computing in Python

HADOOP CLUSTER SETUP GUIDE:

Hadoop Streaming. Table of contents

Mobile Cloud Computing for Data-Intensive Applications

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Package hive. January 10, 2011

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

MapReduce. Tushar B. Kute,

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Record Setting Hadoop in the Cloud By M.C. Srivas, CTO, MapR Technologies

Big Data and Apache Hadoop s MapReduce

Transcription:

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop- 1.0.4. In this report we will present various examples conducted on hadoop. After installation is complete any of the mentioned below example can be run on hadoop as a check for proper installation. The examples explained in this report are as mentioned below 1. wordcount: listing the words that occur is given file along with their occurrence frequency [1] 2. pi: calculating the value of pi [2] 3. pagerank: 4. inverted indexing: 5. indexing wikipedia: In this section we will index the entire English wikipedia 2 Wordcount Wordcount example is counting and sorting words in a given single file or group of files. Files of various size were used for this example. 1 st set of experiment was conducted using single files and 2 nd set of experiment was conducted using group of files. For 1 st set of experiments 5 files were used whose details along with time required for execution of wordcount is given in table 1. For 2 nd set of experiment combination of files from 1 st set were used whose details can be found in table 2 The figures given below are for line 3 of table 2 with 3 files in gutenberg directory in /tmp. Figure 1 shows command given in Listing 1 executed on my machine. It is assumed that the files are located in /tmp directory under appropriate name (in my case the directory name is /tmp/gutenberg). 1 $ bin / hadoop d f s copyfromlocal /tmp/ gutenberg / user / hduser / gutenberg 2 $ bin / hadoop d f s l s / u s e r / hduser / gutenberg Listing 1: Copying files from user machine to hadoop s file system 1

1 st set of experiments file name size cpu time required (ms) pg20417.txt 674.6 KB 3380 pg2243.txt 137.3 KB 2270 pg28885.txt 177.4 KB 2520 pg4300.txt 1.6 MB 4090 pg5000.txt 1.4 MB 3700 Table 1: Time required to count words in single files 2 nd set of experiments file names total size cpu time required (ms) pg4300.txt, pg5000.txt 3.0 MB 6860 pg4300.txt, pg5000.txt, pg20417.txt 3.7 MB 9580 pg2243.txt, pg5000.txt, pg20417.txt, pg28885.txt 2.4 MB 9090 pg2243.txt, pg4300.txt, pg5000.txt, pg20417.txt, pg28885.txt 4.0 MB 11410 Table 2: Time required to count words in multiple files Line 1 in listing 1 is copying files from /tmp/gutenberg in local machine to hadoop s file system in directory /user/hduser/gutenberg. Line 2 in Listing 1 is listing/checking the files just copied in /user/hduser/gutenberg Figure 1: copy files to dfs The command to run wordcount is given in listing 2 and the command executed on my machine is given in listing 3. Files from /user/hduser/gutenberg are used and it s output is stored in /user/hduser/gutenberg-output 1 $ bin / hadoop j a r hadoop examples. j a r wordcount / u s e r / hduser / gutenberg / user / hduser / gutenberg outout Listing 2: Copying files from user machine to hadoop s file system 1 hduser@ada desktop : / u s r / l o c a l / hadoop$ bin / hadoop j a r hadoop examples. j a r wordcount / user / hduser / gutenberg / user / hduser / gutenberg output 2 Warning : $HADOOP HOME i s deprecated. 3 4 13/07/29 1 4 : 2 0 : 5 7 INFO input. FileInputFormat : Total input paths to p r o c e s s : 3 5 13/07/29 1 4 : 2 0 : 5 7 INFO u t i l. NativeCodeLoader : Loaded the n a t i v e hadoop l i b r a r y 6 13/07/29 1 4 : 2 0 : 5 7 WARN snappy. LoadSnappy : Snappy n a t i v e l i b r a r y not loaded 2

7 13/07/29 1 4 : 2 0 : 5 7 INFO mapred. J o b C l i e n t : Running job : j o b 2 0 1 3 0 7 2 9 1 3 4 9 0 0 0 1 8 13/07/29 1 4 : 2 0 : 5 8 INFO mapred. JobClient : map 0% reduce 0% 9 13/07/29 1 4 : 2 1 : 1 3 INFO mapred. JobClient : map 66% reduce 0% 10 13/07/29 1 4 : 2 1 : 1 9 INFO mapred. JobClient : map 100% reduce 0% 11 13/07/29 1 4 : 2 1 : 2 2 INFO mapred. JobClient : map 100% reduce 22% 12 13/07/29 1 4 : 2 1 : 3 1 INFO mapred. JobClient : map 100% reduce 100% 13 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Job complete : j o b 2 0 1 3 0 7 2 9 1 3 4 9 0 0 0 1 14 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Counters : 29 15 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Job Counters 16 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Launched reduce t a s k s=1 17 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : SLOTS MILLIS MAPS=20523 18 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Total time s p ent by a l l r e d u c e s w a i t i n g a f t e r r e s e r v i n g s l o t s (ms)=0 19 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Total time s p ent by a l l maps w a i t i n g a f t e r r e s e r v i n g s l o t s (ms)=0 20 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Launched map t a s k s=3 21 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Data l o c a l map t a s k s=3 22 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : SLOTS MILLIS REDUCES=16245 23 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : F i l e Output Format Counters 24 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Bytes Written =880838 25 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : FileSystemCounters 26 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : FILE BYTES READ=2214875 27 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : HDFS BYTES READ=3671884 28 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : FILE BYTES WRITTEN=3775583 29 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : HDFS BYTES WRITTEN=880838 30 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : F i l e Input Format Counters 31 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Bytes Read=3671523 32 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Map Reduce Framework 33 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map output m a t e r i a l i z e d bytes =1474367 34 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map i n p u t r e c o r d s =77931 35 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce s h u f f l e b y t e s =1207341 36 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : S p i l l e d Records =255966 37 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Map output bytes =6076101 38 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Total committed heap usage ( b y t e s ) =586285056 39 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : CPU time spent (ms) =9580 40 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Combine i n p u t r e c o r d s =629172 41 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : SPLIT RAW BYTES=361 42 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce i n p u t r e c o r d s =102324 43 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Reduce input groups =82335 44 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Combine output r e c o r d s =102324 45 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : P h y s i c a l memory ( b y t e s ) snapshot =625811456 46 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce output r e c o r d s =82335 47 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : V i r t u a l memory ( b y t e s ) snapshot =1897635840 48 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map output r e c o r d s =629172 49 hduser@ada desktop : / u s r / l o c a l / hadoop$ Listing 3: wordcount executed on /user/hduser/gutenberg In case the system is not able to detect the jar file the following error message is received 1 Exception in thread main java. i o. IOException : Error opening job j a r : hadoop examples. j a r at org. apache. hadoop. u t i l. RunJar. main ( RunJar. java : 90) 2 Caused by : j a v a. u t i l. z i p. ZipException : e r r o r i n opening z i p f i l e In such cases use complete name of jar file (instead of hadoop*examples*.jar use hadoop-examples-1.0.3.jar) and run the command again 3

As mentioned the output is stored in /user/hduser/gutenberg-output, to check if file exist run the command given in line 2 of listing 1 and in command replace gutenberg with gutenberg-output. Figure 2 shows the file present in my system. Figure 2: checking the files produced by wordcount Figure 3 shows the retrieved output which can be checked by importing the results back to local system. notice -getmerge in line 2 of listing 4, it merges everything present in gutenberg-output folder. 1 $ mkdir /tmp/ gutenberg output 2 $ bin / hadoop d f s getmerge / user / hduser / gutenberg output /tmp/ gutenberg output 3 $ head /tmp/ gutenberg output / gutenberg output Listing 4: Checking wordcount results after importing results to local system Figure 3: Checking wordcount results Results can be retrieved without importing the results also, just use the command given in listing!5 1 $ bin / hadoop d f s cat / user / hduser / gutenberg output / part r 00000 Listing 5: Checking wordcount results without importing results to local system 4

3 Value of PI Hadoop can be used to calculate value of PI. value of pi is 3.14159 1. Value of pi is calculated using quasi-monte Carlo method in this example. Value of pi can be estimated using command in listing 6. We define two values after pi first value is of x the number of maps and second value is y the number of samples per map. Result of some experiments conducted is given in table 3 1 $ bin / hadoop j a r hadoop examples. j a r p i 10 100 Listing 6: command to calculate value of pi x y Time required (secs) Value calculated 10 100 60.53 3.148 10 200 53.53 3.144 10 400 55.58 3.14 10 1000000 54.45 3.1415844 50 100 178.82 3.1418 Table 3: Time required to calculate value of PI for different x and y References [1] Michael G. Noll. Running hadoop on ubuntu linux (single-node cluster) - michael g. noll. http://www.michael-noll.com/tutorials/running-hadoopon-ubuntu-linux-single-node-cluster/. [2] Cloud 9. Cloud9: A mapreduce library for hadoop >> getting started in standalone mode. http://lintool.github.io/cloud9/docs/content/startstandalone.html. 1 http://en.wikipedia.org/wiki/pi 5