A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM



Similar documents
Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Hadoop IST 734 SS CHUNG

Apache Hadoop. Alexandru Costan

Hadoop and Map-Reduce. Swati Gore

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Sistemi Operativi e Reti. Cloud Computing

Grid Computing Vs. Cloud Computing

Data-Intensive Computing with Map-Reduce and Hadoop

L1: Introduction to Hadoop

Open source Google-style large scale data analysis with Hadoop

Cisco Integration Platform

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Application Development. A Paradigm Shift

Mobile Cloud Computing T Open Source IaaS

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

MapReduce, Hadoop and Amazon AWS

Big Data and Cloud Computing for GHRSST

CLOUD COMPUTING. When It's smarter to rent than to buy

Introduction to Cloud Computing

NoSQL and Hadoop Technologies On Oracle Cloud

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA SOLUTION DATA SHEET

Accelerating and Simplifying Apache

Cloud Computing : Concepts, Types and Research Methodology

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Comparing Ganeti to other Private Cloud Platforms. Lance Albertson

Keywords: Cloudsim, MIPS, Gridlet, Virtual machine, Data center, Simulation, SaaS, PaaS, IaaS, VM. Introduction

CSE-E5430 Scalable Cloud Computing Lecture 2

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

A Brief Outline on Bigdata Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Certified Cloud Computing Professional VS-1067

Comparing Open Source Private Cloud (IaaS) Platforms

Sriram Krishnan, Ph.D.

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

BIG DATA USING HADOOP

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Viswanath Nandigam Sriram Krishnan Chaitan Baru

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

How To Understand Cloud Computing

BIG DATA TRENDS AND TECHNOLOGIES

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

A Survey on Cloud Computing

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

How To Scale Out Of A Nosql Database

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Cloud computing: the state of the art and challenges. Jānis Kampars Riga Technical University

How To Understand Cloud Computing

Towards Comparative Evaluation of Cloud Services

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Hadoop Parallel Data Processing

Large scale processing using Hadoop. Ján Vaňo

Cloud Computing using

Cloud computing - Architecting in the cloud

Map Reduce & Hadoop Recommended Text:

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Hadoop. Sunday, November 25, 12

Big Data and Apache Hadoop s MapReduce

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Introduction to Big Data Training

A Cost-Evaluation of MapReduce Applications in the Cloud

Big Data on Microsoft Platform

A Middleware Strategy to Survive Compute Peak Loads in Cloud

A Gentle Introduction to Cloud Computing

Cloud on TEIN Part I: OpenStack Cloud Deployment. Vasinee Siripoonya Electronic Government Agency of Thailand Kasidit Chanchio Thammasat University

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Virtual Machine Instance Scheduling in IaaS Clouds

Hadoop Architecture. Part 1

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

OpenStack IaaS. Rhys Oxenham OSEC.pl BarCamp, Warsaw, Poland November 2013

Transcription:

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal lazymesh@gmail.com, manoj@ku.edu.np Abstract Cloud computing, data and distributed systems are three important aspects of this paper. Cloud computing is being embraced by every organization and is being implemented in every field of work, be it in business or in education. Data storage and processing is fundamental task of any organization. Hadoop is a distributed framework created to handle the big data processing task. The aim of this paper is to study and analyze different aspects such as performance, flexibility, scalability and more on Hadoop clusters in the cloud and in commodity. Introduction Cloud computing is an abstract term describing the use of resources, which don t belong to the user to perform required task and then disconnect from the resources not in use. Buyya et al. [1] have defined it as follows: Cloud is a parallel and distributed computing system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources based on service-level agreements (SLA) established through negotiation between the service provider and consumers. Vaquero et al. [2] have stated clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically reconfigured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized Service Level Agreements. Cloud computing [3] technology started very early when mainframe computers were available in academia and corporations, accessible through clients computers to be shared. Cloud computing [5] is based on different computing research areas such as HPC, virtualization, utility computing, grid computing, networking, security and many others. Depending on service providers cloud computing can be broadly divided into: [4] Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service(IaaS). SaaS provides both resources and applications as a Service to the clients. PaaS provides a level above SaaS by enabling clients to access to platforms that they need to develop. And IaaS provides the storage and computing infrastructure to the clients over the internet. Based on technology used cloud computing can be subcategorized into: [4] Public, Private, Community and Hybrid clouds. A public cloud is accessible with an internet connection. A private cloud is limited to an organization or group. A community cloud is a private cloud shared between groups or organizations. And hybrid cloud is a mixture of at least two clouds explained above. There are many platforms available to set up a cloud. Openstack was chosen due to following reasons: First of all, it is open-source, meaning it is open to pick and mix any hardware needs, open to design own networks, open to use any virtualization technology, open to other needed features and so on. Secondly, it has the largest group of developers and contributors. Thirdly it is simple to configure and use. Data is an inseparable entity of any organization. According to [6] Every day, we create 2.5 quintillion bytes of data. The obvious cause is the advancement of technology and its use. The piled up data needs to be stored,

processed and analyzed to get useful information. There are many technical solutions to it but Hadoop is chosen due to following reasons: Hadoop makes data mining, analytics, and processing of big data cheap and fast. Hadoop is an open source project and is made to deal with terabytes of data in minutes. Hadoop is the only way that companies with gigantic amounts of data like Facebook, Twitter, Yahoo, ebay and Amazon can cost-effectively and quickly make decisions. Hadoop is easily scalable as hard drives and nodes can be added without need of shutting down Hadoop. Hadoop stores and processes any kind of data. Hadoop is natively written in Java but can be accessed using other languages such as SQL-inspired language (Hive), c/c++, python and many more. Knowledge on Hadoop is a must to understand the paper. The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It provides for building distributed system for data storage, data analysis, and coordination. A framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of Map-Reduce. Hadoop employs a master/slave architecture for both distributed storage and distributed computation. This paper is divided into five parts, Introduction, cloud, cluster, analysis and finally conclusion. The cloud section explains in brief about Openstack cloud. The cluster section explains the clustering of Hadoop system in real computers and in the cloud and also explains running of applications on the distributed systems. The analysis section describes the comparison of two system with appropriate graphs. The cloud According to the official Openstack starter guide [7] Cloud computing is a computing model, where resources such as computing power, storage, network and software are abstracted and provided as services on the Internet in a remotely accessible fashion.. An infrastructure setup using the cloud computing model is generally referred to as the "cloud". The guide further explains that Openstack is a collection of open source software projects that enterprises/service providers can use to setup and run their cloud compute and storage infrastructure. The installation of cloud (Openstack) consists of five projects: nova-compute, swift, glance, keystone and Horizon which is more clear with the following figures. Fig 1 simple Openstack architecture Fig 2 simple Openstack workflow

There are many reasons behind choosing Openstack. The first reason is the flexibility in Openstack regarding networking and hardware. Open-source is the second reason with millions of developers and contributors. Thirdly Openstack has become phenomenon meaning that big companies like Dell, AMD, Cisco, HP and others alongside Rackspace are using it. Furthermore, linux heavyweights like Red Hat and Ubuntu are implementing Openstack. Lastly but not the least, it is simple and flexible to configure and use. The clusters After the successful installation and configuration of Openstack cloud, virtual servers were created in the cloud. For this experiment, under Hadoop project, four instances were created, one acting as master/slave and the rest three as slaves. Four-node cluster was created for the Hadoop jobs to be run and analysed. The same configuration was applied to personal computers as well. The cluster in the cloud was made identical to the cluster in the real systems. The details of the clusters are as given in the table below. Table 1: details of the clusters Servers cloud vs real Details (personal computers) Details (virtual servers in cloud) Master vs kucse-dcg 2 GB Ram 2 VCPU 160 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave1 vs user 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave2 vs user1 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave3 vs user3 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Screenshot of the dashboard of the cloud is given below. Fig 3: dashboard of Kathmandu University cloud with Hadoop project

After the successful configuration of the clusters, three jobs: two jobs to convert image files to pdf files and one job of word count were run on both systems. The first two jobs were based on image and pdf files being serialized in map reduce framework [9][10][11], and the last job was available in the Hadoop package itself. The results were not so much contradictory which are summarized in the tables given below. Table 2: summary of first job Cluster in personal computers Cluster in virtual servers in cloud inputs 23 folder 94 image files 169 MB 23 folder 94 image files 169 MB outputs 23 folder 94 pdf files 90.1 MB 1 folder 94 pdf files 90.1 MB Time taken 3 minutes and 8 seconds 1 minute and 31 seconds Table 3: summary of second job Cluster in personal computers Cluster in virtual servers in cloud inputs 1 folder 476 image files 926 MB 1 folder 476 image files 926 MB outputs 1 pdf files 200.1 MB 1 pdf files 200.1 MB Time taken 7 minutes and 51 seconds 9 minutes and 22 seconds Table 4: summary of third job Cluster in personal computers Cluster in virtual servers in cloud inputs 1 text file 1.1 GB 1 text file 1.1 GB outputs 1 text file with counts 364.6 KB 1 text file with counts 364.6 KB Time taken 4 minutes and 0 second 5 minutes and 1 second Three graphs generated from the above tables are given below. Graph 1 time taken for the first job Graph 2 time taken for the second job Graph 3 time taken for the third job

Analysis The Hadoop distributed system set up on personal computers was certain to be more efficient and faster than the cloud system. The obvious reasons were that Hadoop framework is developed with commodity machines in mind and that the processing is done in real hardware without any resources sharing as compared to cloud systems. The first job was contradictory with the points discussed above and with other two jobs. The reason is that the job has to recursively read and write files, thus has to cache all the bytes read and due to hardware inefficiency in the personal computers used (especially memory), the job has to be canceled from the inefficient machines (task trackers). Thus the job was a bit slow in real computers due to cancellation of jobs and transferring the job to other tasktrackers (nodes/machines). To sum up, the personal computers Hadoop cluster was as powerful and fault tolerant as the cloud Hadoop cluster but not as scalable and flexible as the cloud cluster. The reasons were that the real systems are of real hardware and can not be made available easily but the cloud cluster was of virtualized system so could be varied according to needs and availability. Creating a new virtual server (node) in the cloud could be done in seconds but adding a new node in the real system had to account the availability of computers. A problem in virtual server can be solved by terminating and creating a new one but the same could not be applied in real system. The real system were connected with network cables so had a little but insignificant delay in communication between nodes in real system while the nodes were inside the same project which made the communication between nodes much faster. Most important aspect of the cloud is that server creation and termination is extraordinarily easy i.e. the cluster can be scaled up and down with great ease. Another aspect of the cloud is that the cluster can be made public without any difficulty as we can see in the above screenshot that the master server s web interface is accessed through browser in the client. Conclusion This paper is an analysis of running a Hadoop cluster in cloud and in real system and identifying the best solution by running simple Hadoop jobs in the configured clusters. This paper concludes that running a Hadoop cluster in cloud for data storage and analysis is more flexible and easily-scalable than the real system cluster. This paper also concludes the cluster in real system computers are faster than the cloud clusters. But due to different advantageous features of the cloud computing system such as quick termination of servers (nodes) if problems arise and creation of the node from the same state the machine was terminated, automatic networking, instant creation of nodes and cluster and many such features cloud Hadoop cluster would be more favorable. References [1] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility, Future Generation Computer Systems, 25:599_616, 2009. [2] L. M. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, A break in the clouds: Towards a cloud definition, SIGCOMM Computer Communications Review, 39:50_55, 2009. [3] Cloud Computing on Wikipedia, en.wikipedia.org/wiki/cloudcomputing [4] http://www.us-cert.gov/sites/default/files/publications/cloudcomputinghuthcebula.pdf [5] Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and Andrzej Goscinski

[6] http://www-01.ibm.com/software/data/bigdata/ [7] Openstack compute starter guide, May 4 2012 (essex) [8] http://hadoopinku.wordpress.com/category/hadoop-2/ [9] Tom White, Hadoop: The Definitive Guide, O Reilly Media, 2009 [10] Chuck Lam, Hadoop in Action, MEAP Unedited Draft, 2010 [11] Bruno Lowagie, itext in Action, 2010