INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY



Similar documents
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

The Hadoop Distributed File System

Hadoop Architecture. Part 1

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Apache Hadoop. Alexandru Costan

Data-Intensive Computing with Map-Reduce and Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

The Hadoop Distributed File System

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big Data With Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop implementation of MapReduce computational model. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo

Apache Hadoop new way for the company to store and analyze big data

NoSQL and Hadoop Technologies On Oracle Cloud

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

HadoopRDF : A Scalable RDF Data Analysis System

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Survey on Scheduling Algorithm in MapReduce Framework

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Processing of Hadoop using Highly Available NameNode

Parallel Processing of cluster by Map Reduce

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Fault Tolerance in Hadoop for Work Migration

Hadoop IST 734 SS CHUNG

Distributed File Systems

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

International Journal of Advance Research in Computer Science and Management Studies

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Hadoop Parallel Data Processing

Design of Electric Energy Acquisition System on Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Distributed File Systems

A very short Intro to Hadoop

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Apache HBase. Crazy dances on the elephant back

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Deploying Hadoop with Manager

Suresh Lakavath csir urdip Pune, India

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic


International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

HDFS Users Guide. Table of contents

HADOOP MOCK TEST HADOOP MOCK TEST

Open source Google-style large scale data analysis with Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Intro to Map/Reduce a.k.a. Hadoop

A Brief Outline on Bigdata Hadoop

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Generic Log Analyzer Using Hadoop Mapreduce Framework

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A Database Hadoop Hybrid Approach of Big Data

How To Use Hadoop

Internals of Hadoop Application Framework and Distributed File System

Hadoop Ecosystem B Y R A H I M A.

Log Mining Based on Hadoop s Map and Reduce Technique

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

Hadoop & its Usage at Facebook

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Testbed for Research and Education Networks Analysis. SomkiatDontongdang, PanjaiTantatsanawong, andajchariyasaeung

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

TP1: Getting Started with Hadoop

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

BBM467 Data Intensive ApplicaAons

Distributed Filesystems

Design and Evolution of the Apache Hadoop File System(HDFS)

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Open source large scale distributed data management with Google s MapReduce and Bigtable

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms


Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Anumol Johnson et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (1), 2015,

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Transcription:

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department of Computer Science & Engineering, University Institute of Engineering & Technology, Kurukshetra University, Haryana. Accepted Date: 24/09/2015; Published Date: 01/10/2015 Abstract: - As data is evolving day by day with tremendous amount like exabytes (2 18), at high speed with various varieties like financial data, weather forecasting, social media, email and list go on. This vast amount of data is not warehoused on one site and therefore, a need of a framework required that dispense the data across multiple clusters and provide distributed computing to answer the queries. Hadoop is the solution of above discussed points. Hadoop is open source and fault tolerant. Hadoop provide high available services to the cluster of computers. Keywords: Hadoop, map reduce, client-server, task manager, job tracker \ Corresponding Author: ER. AMRINDER KAUR Access Online On: www.ijpret.com How to Cite This Article: PAPER-QR CODE 122

1. INTRODUCTION Hadoop is free ware; its programming is based on java framework. It supports distributed computing environment in which processing of large data set is to be done. Hadoop is batch oriented where jobs are queued and then executed, and processing of jobs may take minutes or hours. The basic storage mechanism in Hadoop is Hadoop Distributed File System (HDFS) [1] The MapReduce framework is proposed by Google. The framework is responsible of everything else such as parallelization, fail-over etc. With Hadoop's distributed file system, mapreduce framework read and writes its data. Usually, Hadoop MapReduce uses the distributed file system of hadoop known as HDFS, is the open source complement of the Google File System (GFS). Therefore, Hadoop MapReduce job s input and output performance strongly depends on HDFS. 2. ARCHITECTURE Hadoop is open source framework this is composed of hadoop distributed file system and map reduce engine. Hadoop is scalable fault-tolerant distributed system for processing of data and forage. Hadoop provide framework for analysis and transformation of extreme data using map reduce paradigm data storage. Hadoop provide framework for analysis and transformation of extreme data using map reduce paradigm.[3][5] 2.1. Hadoop Distributed File System HDFS is responsible for managing the data or files which is present on different clusters. Meta data of a file and application data which is needed during job are stored separately. Meta data is stored on name node; a dedicated server. Application data are stored on data node; other servers. These servers are connected to each other & communication done with the help of TCP based protocols. In HDFS Raid structure is used to provide data durability. For reliability file content is duplicated on multiple data nodes [2][3][6]. HDFS structure is shown in figure below 123

2.1.1. Name Node Figure: HDFS's Architecture Responsibility of name node is for maintaining the directory structure of HDFS. It is also known as namespace. On namenode inode is used to represent directory and files and it also record attributes like permissions, access times, modification etc. Namenode only maintains mapping between blocks of file and the datanodes on which blocks are stored. Each block of the file is replicated on multiple datanode. For performing any operation on file (like open close, delete, rename) initially client contact the namenode. For example to perform a read operation on a file, a client first contact the namenode for the location of datanode and after that read the data from the closest datanode to the client. And when clients want to write the data onto a file it requests the namenode to nominate three datanodes which contains the block replicas. And writing is performed in a pipeline fashion. Currently a single name node is nominated for each cluster. But datanodes can be hundreds or thousands and may execute multiple tasks concurrently.[2][3][6] 2.1.2. Data Node Datanode is used to hold the block replica and these block replica is represented by two files in the local host native file system. First file contains the data itself and second file contains metadata of block. Except name node all other node will act as a datanode. Each node hold file blocks on the behalf of local or remote host. On the request of name node blocks are created or destroyed on data nodes. Name node is responsible for validating and processing requests from clients. Clients communicate directly with datanode for data in order to read or write data at 124

HDFS block level. In startup phase datanode connects to namenode and perform handshake. Handshake is done to verify the namespace ID and software version of datanode. If ID doesn t match with namenode then datanode automatically shuts down. Each datanode sends heartbeat signal within few seconds and if it fails to send these signals then it will be considered as out of service and namenode find other datanodes for block replica.[2][3][6] 2.2. Map Reduce Hadoop Map Reduce framework is one of the popular implementation of map reduce framework which is proposed by google. It become popular because it is easy to use, scalable and fault tolerent. It is used for processing big data in industry and academia also. It consist of two functions i.e. map and reduce. Both these functions are user defined. The map function take input in the form of (k,v) where k refers to key and v refers to value. After that map function is applied on each pair of (k,v). After that it will generate intermediate key value pair which is showed as (k', v'). Iteratively on each intermediate key value pair reduces function is called and after that reduce function merge all the intermediate values on the basis of a single key. [4][5] Map Reduce Architecture is shown in figure below. Figure: Map Reduce Architecture 125

2.2.1. Job Tracker Job tracker is responsible for maintaining the list of the processing resources available in the clusters. A Job tracker run on master node and it is responsible for distributing the different map reduce jobs into the cluster. When there is a request for job then it schedules the job and assigns the job to the task tracker running on the datanodes. Initially client node submits its job to job tracker Then job tracker is incharge of determining the location of data in datanode. After locating the datanode its corresponding task tracker node is located which is nearest to the data or which have available slots. Job is then assigned to task tracker node. Tasktracker nodes are monitored continuously and if they don't respond with heartbeat signals then they are considered as failed and the job is scheduled on some other task tracker. Due to some reason if job fails then task tracker notify the jobtracker. Then job tracker will decide whether to submit the job somewhere else or to restart the job on same task tracker node. And on the completion of job, jobtracker updates its status about a particular job. And then client node asks the job tracker for information. The diagram below show the cluster setup in the network. [7] Figure: Cluster Setup in the network 126

2.2.2. Task Tracker Duty of task tracker is to execute the job that is assigned by jobtracker node and then report the status of jobs back to job tracker. In the cluster, task tracker daemon runs on every slave node. Therefore processing of data and its storage is also done by tasktracker. Map, reduce and shuffle operations are performed by task tracker and these operations are assigned by job tracker. Tasktracker maintain a set of slots. Some slots are alloted for map tasks and some are alloted for reduce tasks. When jobtracker want to schedule task, then it first determine the available empty slot on the server that contain needed data in the datanode. And if empty slot is not available then jobtracker look for another empty slot in the same rack. Meanwhile of processing tasktracker generate heartbeat signal in few minutes for jobtracker to assure that tasktracker is alive and performing its job. This heartbeat is also useful in determining the number of available slots. After completion of job, tasktracker report back to jobtracker with the status of the job.[7] 3. APPLICATIONS OF HADOOP Hadoop is used to analyze the risks which are life threatening for mankind It is used to identify the security breaches by analyzing the warning signs Hadoop is used to understand the perception of people about company or organization by analyzing their social media conversations. By analyzing sales data based on various factors like weather, days, weekends etc., it will help to understand when to sell which products. With the help of log files which are generated by software contains very useful data. By analyzing these log files one can find security breaches and usage statistics It is used in various fields like politics, data storage, financial services, health care, telecoms, human science, travel etc.[8],[9] 127

4. CONCLUSION Big data is data which is accumulating from different sources and with different varieties like social media, sensor s data, emails etc. on tremendous speed. Today s data's volume range in petabytes but in future it will range from few exabytes to thousands of exabytes. To handle such a volume of data an efficient tool is needed which is able to analyze and mine some useful knowledge from such a large volume of data. Hadoop is the answer of these needs raised due to big data. Hadoop is applicable in all the fields of life like health, science, telecoms, data storage etc. therefore it can be able to answer the different questions raised in different fields. 5. REFERENCES 1. Revolution Analytics White Paper, Advanced Big Data Analytics with R and Hadoop, 2011. 2. Konstantin Shvachko etal., The Hadoop Distributed File System, IEEE, 2010 3. Hadoop. http://hadoop.apache.org, September, 2015. 4. Jens Dittrich et al., Efficient Big Data Processing in Hadoop MapReduce, The 38th International Conference on Very Large Data Bases, Istanbul, Turkey. Proceedings of the VLDB Endowment, Vol. 5, No. 12, August 27th 31 st, 2012. 5. Harshawardhan S. Bhosale etal., A Review Paper on Big Data and Hadoop, International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014. 6. HDFS (Hadoop distributed filesystem) Architecture, September 2015,http://hadoop.apache.o rg/common/docs/current/hdfs design.html. 7. Hadoop Map Reduce framework, September, 2015,https://hadoop.apache.org/docs/r1.2.1/ mapred_tutorial.html 8. web_link1:http://www.mrc-productivity.com/blog/2015/06/7-real-life-use-cases-of-hadoop/ 9. web_link2: http://hadoopilluminated.com/hadoop_book/hadoop_use_cases.html 128