INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY



Similar documents
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data and Apache Hadoop s MapReduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A Brief Outline on Bigdata Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop. Sunday, November 25, 12

Data-Intensive Computing with Map-Reduce and Hadoop

Large scale processing using Hadoop. Ján Vaňo

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Log Mining Based on Hadoop s Map and Reduce Technique

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop Architecture. Part 1

Hadoop implementation of MapReduce computational model. Ján Vaňo

Fault Tolerance in Hadoop for Work Migration

Big Application Execution on Cloud using Hadoop Distributed File System

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop IST 734 SS CHUNG

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Application Development. A Paradigm Shift

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Big Data With Hadoop

The Hadoop Framework

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop and Map-Reduce. Swati Gore

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

How To Scale Out Of A Nosql Database

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Open source Google-style large scale data analysis with Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Manifest for Big Data Pig, Hive & Jaql

Keywords: Big Data, HDFS, Map Reduce, Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Generic Log Analyzer Using Hadoop Mapreduce Framework

Big Data on Cloud Computing- Security Issues

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Data Mining in the Swamp

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Processing of Hadoop using Highly Available NameNode

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Handle Big Data With A Data Scientist

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Comparison of Different Implementation of Inverted Indexes in Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Survey on Scheduling Algorithm in MapReduce Framework

Big Data and HADOOP: A Big Game Changer

NoSQL and Hadoop Technologies On Oracle Cloud

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

International Journal of Advance Research in Computer Science and Management Studies

Hadoop Big Data for Processing Data and Performing Workload

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Map Reduce & Hadoop Recommended Text:

BIG DATA CHALLENGES AND PERSPECTIVES

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Suresh Lakavath csir urdip Pune, India

Enhancing Massive Data Analytics with the Hadoop Ecosystem

BIG DATA USING HADOOP

Implement Hadoop jobs to extract business value from large and varied data sets

BIG DATA TECHNOLOGY. Hadoop Ecosystem

MapReduce with Apache Hadoop Analysing Big Data

Big Data Analytics OverOnline Transactional Data Set

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Big Data on Microsoft Platform

International Journal of Innovative Research in Computer and Communication Engineering

Workshop on Hadoop with Big Data

ImprovedApproachestoHandleBigdatathroughHadoop

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

BIG DATA TRENDS AND TECHNOLOGIES

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Parallel Processing of cluster by Map Reduce

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Survey on Load Rebalancing for Distributed File System in Cloud

Transcription:

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS. S. A. DHAWALE 1, PROF. G. D. GULHANE 2, DR. H. R. DESHMUKH 3, PROF. O. A. JAISINGHANI 4, PROF. S. V. KHEDKAR 4 1. PG Scholar, Dept. of Information Technology, IBSS COE, Amravati. 2. Dept. of CSE, IBSS COE, Amravati. 3. HOD, CSE & IT, IBSS COE, Amravati. 4. Dept. of IT, IBSS COE, Amravati. Accepted Date: 05/03/2015; Published Date: 01/05/2015 Abstract: Apache Hadoop is a set of algorithms written in Java for distributed storage and distributed processing of very large data sets (Big Data). Hadoop is an open-source software framework for storing and processing huge data sets. Hadoop supplements to the existing data warehouse & conventional data source. It is the cost-effective and efficient way of data storage, processing and analyzing structured as well as unstructured data. Hadoop is written in java and runs on large cluster of industry standard servers. It can store petabytes of data in the form of blocks on tens of thousands of servers. Hadoop distributed file system (HDFS) is a base component of the Hadoop framework that actually manages the data storage on the cluster. It stores data in the form of data blocks (default size: 64MB) on the local hard disk. It uses block size of 128MB for a large file set. Keywords: Big data, Hadoop, framework, cluster Corresponding Author: MS. S. A. DHAWALE Access Online On: www.ijpret.com How to Cite This Article: PAPER-QR CODE S. A. Dhawale, IJPRET, 2015; Volume 3 (9): 1513-1519 1513

INTRODUCTION Hadoop is a freeware framework developed for efficient and effective way of storing, distributed processing and analyzing the data set using cluster.it was designed for a purpose of processing the large dataset using cluster with high performance file system termed as Hadoop Distributed File System (HDFS).Hadoop also contain framework for job scheduling Hadoop does not replace enterprise data warehouses, data marts and other conventional data stores. It supplements those enterprise data architectures by providing an efficient and costeffective way to store, process and analyze the daily flood of structured and unstructured data Apache Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry-standard servers configured with directattached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Hadoop is a free software framework developed with the purpose of distributed processing of large data sets using clusters of commodity hardware, implementing simple programming models. It is a middleware platform that manages a cluster of computers that was developed in Java and although Java is main programming language for Hadoop other languages could be used to: R, Python or Ruby. Literature survey: 1) The Bottleneck in Big Data Analysis: Big Data refers to the large amounts, at least terabytes, of poly-structured data that flows continuously through and around organizations. When Yahoo, Google, Facebook, and other companies extended their services to web-scale, the amount of data they collected routinely from user interactions online would have overwhelmed the capabilities of traditional IT architectures. So they built their own. Apache Hadoop is an open source distributed software platform for storing and processing data. 2) The Hadoop framework includes: Hadoop Distributed File System (HDFS) It is a high performance distributed file system in a Hadoop architecture. Hadoop YARN which is a framework for job scheduling and cluster resource management. 1514

Hadoop Map-Reduce a system for parallel processing of large data sets. 3) Architecture: Hadoop is a 3 tier Architecture where first Layer is called as Hardware Layer Which consist of data nodes and the computers the general structure of the analytics tools integrated with Hadoop can be viewed as a layered architecture presented in figure The second layer is the middleware layer Hadoop. It manages the distributions of the files by using HDFS and the Map-Reduce jobs. Then it comes this layer that provides an interface for data analysis. At this level we can have a tool like Pig which is a high-level platform for creating Map-Reduce programs using a language called Pig-Latin. We can also have Hive which is a data warehouse infrastructure developed by Apache and built on top of Hadoop. Hive provides facilities for running queries and data analysis using an SQL-like language called HiveQL and it also provides support for implementing Map-Reduce tasks. 4) Map-Reduce: Mapreduce is the heart of Hadoop. It is used to Process data parallely on the cluster. Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. 5) HDFS: 1515

The Apache Hadoop platform also includes the Hadoop Distributed File System (HDFS), which is designed for scalability and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers. 6) Working process of Hadoop Architecture: Fig 1. HDFS Architecture Hadoop is designed to run on a large number of machines that don t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy. In a centralized database system, you ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That s Map Reduce you map the operation out to all of those servers and then you reduce the results back into a single result set. Architecturally, the reason you re able to deal with lots of data is because Hadoop spreads it out. And the reason you re able to ask complicated computational questions is because you ve got all of these processors, working in parallel, harnessed together. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data 1516

on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop Common is a set of utilities that support the other Hadoop subprojects Applications: Fig. 2: Hadoop Architecture. The main features of the Hadoop framework can be summarized as follows: High degree of scalability: new nodes can be added to a Hadoop cluster as needed without changing data formats, or application that runs on top of the FS; Cost effective: it allows for massively parallel computing using commodity hardware; Flexibility: Hadoop differs from RDBMS, being able to use any type of data, structured or not; Robust: if a node fails from different reasons, the system sends the job to another location of the data and continues processing. 1517

CONCLUSION: Using Hadoop in this way, the organization gains an additional ability to store and access data that they might need, data that may never be loaded into the data warehouse. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. Hadoop is designed to run on cheap commodity hardware, It automatically handles data replication and node failure, It does the hard work you can focus on processing data, Cost Saving and efficient and reliable data processing.[1] A cost-effective and massively scalable platform for ingesting big data and preparing it for analysis. Using Hadoop to offload the traditional ETL processes can reduce time to analysis by hours or even days. Running the Hadoop cluster efficiently means selecting an optimal infrastructure of servers, storage, networking, and software. REFERENCE: 1. Impetus white paper, March, 2011, Planning Hadoop/NoSQL Projects for 2011 by Technologies, Available: 2. http://www.techrepublic.com/whitepapers/planninghadoopnosql-projects-for- 2011/2923717, March, 2011. 3. McKinsey Global Institute, 2011, Big Data: The next frontier for innovation, competition, and productivity, Available:www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Rese arch/technology%20and%20innovation/big%20data/mgi_big_data_full_report.ashx, Aug, 2012. 4. Thomas Herzog, Associate Commissioner, New York State, Thomas Kooy, IJIS Institute Big Data and the Cloud, IJIS Institute Emerging Technologies, Available:http://www.correctionstech.org/meeting/2012/Presentations/Red_01.pdf, Aug, 2012. 5. Jacobs, A., The Pathologies of Big Data, ACM Queue,Available: http://queue.acm.org/detail.cfm?id=1563874,6th July 2009. 6. Apache Software Foundation. Official apache hadoop website, http://hadoop.apache.org/, Aug, 2012. 1518

7. The Hadoop Architecture and Design, Available:http://hadoop.apache.org/common/docs/r0.16.4/hdfs_design.html, Aug, 2012 8. Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D.Stott Parker from Yahoo and UCLA, "Map-Reduce-Merge: Simplified Data Processing on Large Clusters",paper published in Proc. of ACM SIGMOD, pp. 1029 1040, 2007. 9. White, Tom. Hadoop The Definitive Guide 2nd Edition.United States : O'Reilly Media, Inc., 2010.P. Zadrozny and R. Kodali, Big Data Analytics using Splunk, Berkeley, CA, USA: Apress, 2013. 10. F. Ohlhorst, Big Data Analytics: Turning Big Data into Big Money, Hoboken, N.J, USA: Wiley, 2013. 11. J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Commun ACM, 51(1), pp. 107-113, 2008. 12. Apache Hadoop, http://hadoop.apache.org. 13. F. Li, B. C. Ooi, M. T. Özsu and S. Wu, "Distributed data management using MapReduce," ACM Computing Surveys, 46(3), pp. 1-42, 2014. 14. C. Doulkeridis and K. Nørvåg, "A survey of large-scale analytical query processing in MapReduce," The VLDB Journal, pp. 1-26, 2013. 15. S. Sakr, A. Liu and A. Fayoumi, "The family of mapreduce and largescale data processing systems," ACM Computing Surveys, 46(1), pp. 1-44, 2013. 16. A Study on Role of Hadoop in Information Technology era Volume:2 Issue: 2, Feb2013, ISSN No.2277-8160 Vidyasagar S. D. 1519