Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique



Similar documents
THE HADOOP DISTRIBUTED FILE SYSTEM

The Hadoop Distributed File System

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

The Hadoop Distributed File System

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

International Journal of Advance Research in Computer Science and Management Studies

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File Systems

Introduction to Hadoop

Distributed File Systems

Processing of Hadoop using Highly Available NameNode

HDFS Space Consolidation

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Hadoop

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Big Data and Apache Hadoop s MapReduce

Architecture for Hadoop Distributed File Systems

Distributed File Systems

MapReduce (in the cloud)

Reduction of Data at Namenode in HDFS using harballing Technique

Hadoop Distributed FileSystem on Cloud

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

The MapReduce Framework

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

HDFS Users Guide. Table of contents

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

NoSQL and Hadoop Technologies On Oracle Cloud

Mobile Cloud Computing for Data-Intensive Applications

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

Accelerating and Simplifying Apache

Hadoop Big Data for Processing Data and Performing Workload

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Apache Hadoop new way for the company to store and analyze big data

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Hadoop Architecture. Part 1

Sriram Krishnan, Ph.D.

The Recovery System for Hadoop Cluster

Chapter 7. Using Hadoop Cluster and MapReduce

Suresh Lakavath csir urdip Pune, India

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

HFAA: A Generic Socket API for Hadoop File Systems

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Jeffrey D. Ullman slides. MapReduce for data intensive computing

HDFS Architecture Guide

Comparative analysis of Google File System and Hadoop Distributed File System

Performance Analysis of Book Recommendation System on Hadoop Platform

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Introduction to MapReduce and Hadoop

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Distributed Filesystems

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Fault Tolerance in Hadoop for Work Migration

Snapshots in Hadoop Distributed File System

Storage and Retrieval of Data for Smart City using Hadoop

Distributed Lucene : A distributed free text index for Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

The Hadoop Distributed File System

Distributed Framework for Data Mining As a Service on Private Cloud

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Parallel Processing of cluster by Map Reduce

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

The Hadoop Framework

The Google File System

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

A REVIEW: Distributed File System


Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

HADOOP MOCK TEST HADOOP MOCK TEST I

Apache Hadoop. Alexandru Costan

APACHE HADOOP JERRIN JOSEPH CSU ID#

5 HDFS - Hadoop Distributed System

MapReduce Job Processing

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

MAPREDUCE Programming Model

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Hadoop Scheduler w i t h Deadline Constraint

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Intro to Map/Reduce a.k.a. Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data With Hadoop

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

MapReduce and Hadoop Distributed File System

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Transcription:

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in b * MET Principal, Mumbai, India, sunitam_ics@met.edu. Abstract This paper discusses about LinkCount MapReduce program which counts number of links available in the file. The file which is taken as input here is taken from Wikipedia. This file contains various links in double square brackets. In this paper, we have performed experiment on this file. We have shown experimental results of this application on a Hadoop cluster. In this paper, performance of above application has been shown with respect to execution time and number of nodes. We find that as the number of nodes increases the execution time also increases. But this is true only if it is big data. This paper is basically a research study of LinkCount and WordCount MapReduce application. Keywords: Hadoop, MapReduce, HDFS,LinkCount WordCount,) 1. Introduction Apache Hadoop is a software framework that supports data-intensive distributed applications [1]. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce [2] and Google File System (GFS) papers [3] [8]. HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single data node; a cluster of data nodes form the HDFS cluster. The situation is typical because each node does not require a data node to be present. Each data node serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts [1]. MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distribute computing on clusters of computers [1] [8]. This paper covers many sections. Section II covers about Hadoop File System like what is Hadoop file system, its architecture and its communications. Section III discusses about MapReduce like what is MapReduce, its architecture and communications. Section IV and V shows the experimental setup, results and observations. Section VI and VII are conclusion and references. 2. Hadoop file system (HDFS) [5] [8] [10] [12] The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. [5] Hadoop [4] [6] [7] provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce [2] paradigm. * Corresponding author. sunitam_ics@met.edu. 696

2.1. Architecture [5] [11] A. NameNode: The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes, which record attributes like permissions, modification and access times, namespace and disk space quotas. B. DataNodes: Each block replica on a DataNode is represented by two files in the local host s native file system. The NameNode does not directly call DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to: 1. Replicate blocks to other nodes; 2. Remove local block replicas; 3. Re-register or to shut down the node; 4. Send an immediate block report. C. HDFS Client: User applications access the file system using the HDFS client, a code library that exports the DFS file system interface. Similar to most conventional file systems, HDFS supports operations to read, write and delete files, and operations to create and delete directories. The user references files and directories by paths in the namespace. D. Image and Journal: The namespace image is the file system metadata that describes the organization of application data as directories and files. A persistent record of the image written to disk is called a checkpoint. [5] E. CheckpointNode: The Checkpoint Node periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal. The Checkpoint Node usually runs on a different host from the NameNode since it has the same memory requirements as the NameNode F. BackupNode: The Backup Node is capable of creating periodic checkpoints, but in addition it maintains a memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode. Upgrades, File System Snapshots: The purpose of creating snapshots in HDFS is to minimize potential damage to the data stored in the system during upgrades. The snapshot mechanism lets administrators persistently save the current state of the file system, so that if the upgrade results in data loss or corruption it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot. 3. MapReduce programming: [2] [8] [13] MapReduce [2] [8] is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. 3.1. Programming Model [2] [8] [9] Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. 3.2. Example Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code: [2] [8] Map (String key, String value): // key: document name // value: document contents For each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map function emits each word plus an associated count of occurrences (just 1 in this simple example). The reduce function sums together all counts emitted for a particular word. 3.3. Types Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types: [2] [8] Map (k1, v1)! List (K2, v2) Reduce (K2, list (v2))! List (v2) 697

The Hadoop infrastructure consists of one cluster having six nodes geographically distributed in one single lab. For our series of experiments we used the nodes in the Hadoop cluster, with Intel Core 2 @ 93GHZ 4 CPUs and 4 GB of RAM for each node. With a measured bandwidth for endto-end TCP sockets of 100 MB/s, Ubuntu 11.10 and SUN JAVA jdk 1.6.0. Experiments with MapReduce applications: Wordcount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. To run the example, the command syntax is bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir> bin/hadoop dfs -mkdir <hdfs-dir> bin/hadoop dfs -copyfromlocal <local-dir> <hdfs-dir> As of version 0.17.2.1, you only need to run a command like this: Fig.1 Overview of MapReduce Execution [2] [8] an excellent style manual for science writers is [7] 4. Experimental setup The experiments were carried out on the Hadoop cluster. bin/hadoop dfs -copyfromlocal <local-dir> <hdfs-dir> LinkCount: This example reads text files and counts how often Links occur. The input is text files and the output is text files, each line of which contains a Link and the count of how often it occurred, separated by a tab. 5.1. Nodes participated: 5. Results and observations Master Node Table 1- Master and Slave nodes in the experiment. Slave 1 Slave 2 Slave 3 Slave 4 Slave 5 5.2. LinkCount for data size 140 MB: No. of Nodes Table 2: Observations of LinkCount MapReduce application Node CPU Spent Time(ms) Average (ms) 1 Master 1. 26290 2. 26130 3. 26070 4. 26440 5. 26150 2 Slave 1 1. 26920 2. 27710 3. 26970 4. 26340 5. 26950 3 Slave 2 1. 27020 2. 27270 3. 27640 4. 29050 5. 27920 4 Slave 4 1. 27350 2. 27920 3. 26600 26216 26978 27780 27536 698

4. 27990 5. 27820 5 Slave 5 1. 28420 2. 28280 3. 27910 4. 27790 5. 28570 6 Slave 6 1. 27820 2. 27600 3. 28950 4. 28890 5. 28140 28194 28280 Table 2: Observations of LinkCount MapReduce application Fig.2. Results of LinkCount MapReduce application 5.3. Analysis 1. If no. of nodes is more & data size is less then it will create overhead problem. 2. As we can see the effect in the graph is 1 producing result in less time compare to when nodes were 6. 5.4. Wordcount for data size 140 MB Table 3- Observations of wordcount MapReduce application Sr. No. Time spent in Execution(ms) Average time (ms) No. of Nodes 1 4910 6424 2 7470 6690 6490 6560 2 6760 5970 3 699

Table 3- Observations of wordcount MapReduce application 6400 4880 5760 6050 3 6340 5876 4 4750 6180 6080 6030 Fig.3. Results of wordcount MapReduce application 5.6. Analysi 1. In the above graph we can see that no. of nodes is 4 and time is getting reduced. 2. We can say that if no. of nodes were fit to data size i.e. if MapReduce job requires 4 nodes sufficient to execute the complete program then no need to vary the size of cluster. 6. Conclusion In this paper, we have seen LinkCount and WordCount applications observations, results and analysis. Analysis shows that as the number of nodes increases in cluster time spent in execution increases if data size is small and vice - versa. In this way it shows that MapReduce applications results totally depended on the size of Hadoop cluster and data size. This Hadoop cluster contained four nodes i.e. one master and three slaves. References [1] Apache Hadoop. http://en.wikipedia.org/wiki/apache_hadoop. [2] Jeffrey Dean and Sanjay Ghemawat MapReduce: Simplified Data Processing On Large Clusters, Google, Inc., Usenix Association OSDI 04: 6th Symposium on Operating Systems Design and Implementation, 2009 [3] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung The Google File System, Google, Sosp 03, October 19 22, 2003, Bolton Landing, New York, USA. Copyright 2003 ACM 1-58113-757-5/03/0010, 2003 [4] Apache Hadoop. http://hadoop.apache.org [5] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler The Hadoop distributed File System, Yahoo! Sunnyvale, California USA, 978-1-4244-7153-9/10/$26.00 2010 IEEE, 2010. [6] J. Venner, Pro Hadoop. Apress, June 22, 2009. [7] T. White, Hadoop: The Definitive Guide, O Reilly Media, Yahoo! Press, June 5, 2009. [8] Mahesh Maurya,Sunita Mahajan, Performance analysis of MapReduce programs on Hadoop Cluster, WICT2012, 978-1-4673-4804-1_c 2012 IEEE, 2012 World Congress on Information and Communication Technologies, page 505-510. 700

[9] Ralf Lammel Google s Mapreduce Programming Model Revisited, Data programmability Team Microsoft Corp. Redmond, WA, USA, This Paper Has Been Published In SCP, Http://En.Wikipedia.Org/Wiki/Mapreduce, 2008. [10] Aosabook. http://www.aosabook.org/en/hdfs.html. [11]Architecture of Hadoop. http:// bigdataprocessing. wordpress. Com /2012/07/27/architecture-of-hadoop/. [12] Hadoop overview. http:// www. masukomi.org/ writings/ hadoop_ overview.html. [13] Tiddlyspot. http://karalamadefteri.tiddlyspot.com/ 701