Chapter 7. Using Hadoop Cluster and MapReduce



Similar documents
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture. Part 1

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Apache Hadoop. Alexandru Costan

NoSQL and Hadoop Technologies On Oracle Cloud

Data-Intensive Computing with Map-Reduce and Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop IST 734 SS CHUNG

MapReduce, Hadoop and Amazon AWS

Apache Hadoop new way for the company to store and analyze big data

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Big Data With Hadoop

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to MapReduce and Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

A Brief Outline on Bigdata Hadoop

Map Reduce & Hadoop Recommended Text:

Parallel Processing of cluster by Map Reduce

The Hadoop Framework

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

HadoopRDF : A Scalable RDF Data Analysis System

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Survey on Scheduling Algorithm in MapReduce Framework

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani


Open source Google-style large scale data analysis with Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop implementation of MapReduce computational model. Ján Vaňo

Efficient Analysis of Big Data Using Map Reduce Framework

Internals of Hadoop Application Framework and Distributed File System

Hadoop Parallel Data Processing

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

L1: Introduction to Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

A very short Intro to Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Large scale processing using Hadoop. Ján Vaňo

Generic Log Analyzer Using Hadoop Mapreduce Framework

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop Scheduler w i t h Deadline Constraint

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Fault Tolerance in Hadoop for Work Migration

Intro to Map/Reduce a.k.a. Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data on Microsoft Platform

Processing of Hadoop using Highly Available NameNode

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data and Apache Hadoop s MapReduce

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

UNDERSTANDING THE BIG DATA PROBLEMS AND THEIR SOLUTIONS USING HADOOP AND MAP-REDUCE

Hadoop and Map-Reduce. Swati Gore

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Big Data - Infrastructure Considerations

How To Use Hadoop

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

How To Scale Out Of A Nosql Database

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Big Application Execution on Cloud using Hadoop Distributed File System

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

MapReduce. Tushar B. Kute,

MapReduce and Hadoop Distributed File System

An Hadoop-based Platform for Massive Medical Data Storage

Hadoop Cluster Applications

Lecture 10 - Functional programming: Hadoop and MapReduce

Yuji Shirasaki (JVO NAOJ)

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Transcription:

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152

7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in today s enterprises has been growing at exponential rates day by day. In this electronic age, increasing number of organizations are facing the problem of explosion of data and the size of the databases used in today s enterprises has been growing at exponential rates. Data is generated through many sources like business processes, transactions, social networking sites, web servers, etc. and remains in structured as well as unstructured form [176]. Today's business and scientific applications are having enterprise features like large scale, data-intensive, web-oriented and accessed from diverse devices including mobile devices. Simultaneously, the need to process and analyze the large volumes of data for business decision making has also increased. Processing or analyzing the huge amount of data or extracting meaningful information is a challenging task. In several business and scientific applications, there is a need to process terabytes of data in efficient manner on daily bases. This has contributed to the big data problem faced by the industry due to the inability of conventional database systems and software tools to manage or process the big data sets within tolerable time limits. Big Data and Related problems The term Big Data refers to large sized data sets which are difficulty to storage, process and analyze by currently available tools and database management systems, within reasonable time limit constraints. Big data sizes in today s organizations are constantly increasing - from terabytes to petabytes of data [177]. Big data also includes the unstructured and semi-structured data like web logs, RFID generated data, sensor networks, video surveillance data in the organizations, audio, video and graphical data, satellite and geo-spatial data, social data from social networks, Internet data and web pages, Internet search indexing, domain specific big data like biological life science data, banking and financial portfolios, telephone call records, Modeling and Prototyping of RMS for QoS Oriented Grid Page 153

national security surveillance, medical images and records etc. Big Data scenarios are faced by many organizations like Walmart (which handles billions of transaction per day resulting into database size of multiple petabytes), facebook and twitter like sites which have to handle billions of unstructured data like photos, videos, text feeds and messages. In addition to variations in the amount of data stored in different sectors, the types of data generated and stored i.e., whether the data encodes video, images, audio, or text/numeric information also differ markedly from industry to industry [178]. The big data problem emerges due to the inability of currently available database management systems and software tools to store and process large amounts of structured and unstructured data. Processing of data can include various operations depending on usage like culling, tagging, highlighting, indexing, searching, faceting, etc operations. It is not possible for single or few machines to store or process this huge amount of data in a finite time period. To solve the above problems, I have done prototype implementation of Hadoop cluster in grid test bed, HDFS storage and Map Reduce framework for processing large data sets by considering compute and big data application scenario. The results obtained from various experiments indicate favorable results of above approach to address the big data problem. Modeling and Prototyping of RMS for QoS Oriented Grid Page 154

7.1 Hadoop and Map Reduce Framework Hadoop is a software framework developed by apache open source foundation for creating cluster and parallel processing of data intensive applications in a for reliable, scalable and fault tolerant manner. The framework includes data cluster, distributed data storage and parallel processing model called Map Reduce. Hadoop is highly scalable to work with thousand of processing nodes and petabytes of data. HDFS (Hadoop Distributed File System) The Hadoop Distributed File System (HDFS) runs on commodity hardware and provide high performance distributed file system which is reliable and fault tolerant. HDFS is designed for big data applications that needs to process large amount of heterogeneous data using high throughput concurrent and parallel access and processing. HDFS stores the data sets in parallel using large number of servers/commodity machines, which makes it possible to perform parallel processing of local data available in nodes (Map reduce jobs). HDFS has master/slave architecture. At the loading time, big sized data is split into parts which are stored and processed by different nodes in the hadoop cluster [179]. MapReduce Framework MapReduce is a parallel programming framework based on the model published by Google in 2004 to support distributed computing on large data sets on compute clusters. It provides parallel programming model for parallel and distributed processing of large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key [180]. Modeling and Prototyping of RMS for QoS Oriented Grid Page 155

The problem needs to be implemented in Map Reduce framework steps which operates on list based processing of data. The working of Map Reduce steps is as under [181] [182]: Figure 7.1: Distributed Map and Reduce processes [182] "Map" step: Master node filters, transforms the input and assigns the partitions of input data to worker nodes. Mapping step processes the input data and generates the intermediate result. The mapping step is done in parallel manner by multiple slave nodes. Map (k1, v1) list (K2, v2) Modeling and Prototyping of RMS for QoS Oriented Grid Page 156

"Reduce" step: The intermediate results from multiple mapping processes are combined by reduce process. Reduce function is also applied in parallel manner to each group of intermediate results with same key values. Reduce (K2, list (v2)) list (v3) mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) Figure 7.2: Pseudo code of Map Reduce for counting words in different files program [183] Modeling and Prototyping of RMS for QoS Oriented Grid Page 157

7.2 Experimental Setup and System Architecture The system architecture comprises of multiple node cluster having one master node, multiple slave nodes, HDFS name node and multiple HDFS data node for storing data. The HDFS architecture consists of single name node running on master node which manages the meta-data and access to the file blocks stored by multiple data nodes running on slave nodes. The data is divided into file blocks and stored in distributed manner by different data nodes. The file blocks are automatically replicated (default replication factor is 3) and redundantly by different data nodes to ensure high available and fault tolerance. Client Job Result Master Node Job Tracker (Job Scheduling) Name Node (Meta data of HDFS) Map Reduce Tasks Map Reduce Tasks Task Result Task Result Task Tracker (Executes Job Work units) Slave Node 1 HDFS Data Node MapReduce Shuffle Task Tracker (Executes Job Work units) Slave Node 2 HDFS Data Node Figure 7.3: Hadoop Cluster Setup using Master and Slave nodes For performing the big data experiments, setup of Hadoop data cluster comprising of four nodes and Hadoop Distributed File System (HDFS) for storage was used. Before moving to multi-node cluster, single node cluster was first configured and tested. Modeling and Prototyping of RMS for QoS Oriented Grid Page 158

Hadoop has many configuration parameters, but the most relevant for the purpose of this evaluation is the number of concurrent Map and Reduce tasks that are allowed to run on each node. We configured our cluster to run eight concurrent tasks per worker node. Each Map/Reduce program that is run is partitioned into M map tasks and R reduce tasks. Input and output data for the Map/Reduce programs is stored in HDFS, while input and output data for the data-parallel stack-based implementation is stored directly on the local disks. One node was configured as Master node and other nodes were designated as slave nodes. The master node runs the master daemons: NameNode for the HDFS storage layer and JobTracker for the MapReduce processing layer. The slave nodes run the slave daemons: DataNode for the HDFS layer and TaskTracker for MapReduce processing layer. The master node was also used as slave node to increase the processing nodes. The software used for master and slave nodes was Sun Java 1.6, Ubuntu Linux 10.04 LTS and hadoop 1.0.3. Modeling and Prototyping of RMS for QoS Oriented Grid Page 159

7.3 Experiment Results 7.3.1. Text processing application The first experiment was a simple text processing word count experiment to count the number of words that occur within a set of large sized documents. Different documents were loaded on different nodes. The map function runs on each node and processes the document stored locally and generates intermediate result in form of ( word, 1). The shuffle step organizes and sorts all the intermediate results generated by the Map process based on same key value, and all results with same key value are passed on to same reducer process. The reducing process, accepts the intermediate results with same key (word), apply the aggregation logic i.e. count and add the number of words occurred in all the documents. 1) Experiment with increase in number of nodes Dataset: Size of files used = 100 Mb Figure 7.4: Execution time with varying number of nodes 2) Experiment with increase in size of dataset and nodes Modeling and Prototyping of RMS for QoS Oriented Grid Page 160

Figure 7.5: Execution time with varying dataset and nodes The above experiment results indicate that there is significant decrease in execution time with increase in number of nodes used in Hadoop cluster. Modeling and Prototyping of RMS for QoS Oriented Grid Page 161

7.3.2. Earthquake Data Analysis In the second experiment, we have analyzed the earth quake data published by U.S. Geological Survey (USGS). The earth quake data is available in the form of CSV (comma-separated values) files published at periodic intervals. The analysis of earth quake data provides answer to where the earthquakes occurred by location, number of earth quakes on particular date. 1) Experiment with increase in number of nodes Dataset - 7 day Earthquake Report of USGS Figure 7.6: Quake analysis No. of nodes v/s Execution time Modeling and Prototyping of RMS for QoS Oriented Grid Page 162

2) Experiment with increase in size of dataset and nodes Figure 7.7: Earth Quake analysis No. of days report v/s Execution time The above experiment results indicates that Hadoop and Map reduce setup is highly scalable to process large datasets and job execution time is decreased with the use of more number of nodes in the hadoop cluster. In this experimental work, Hadoop cluster, HDFS and Map Reduce programming framework is used to address data intensive application scenarios. The results obtained from various experiments indicate favorable results of above approach to address big data problem. The above experiment result indicates that Hadoop cluster is scalable to support increased dataset and takes less job execution time with increase in number of nodes. Modeling and Prototyping of RMS for QoS Oriented Grid Page 163