A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal lazymesh@gmail.com, manoj@ku.edu.np Abstract Cloud computing, data and distributed systems are three important aspects of this paper. Cloud computing is being embraced by every organization and is being implemented in every field of work, be it in business or in education. Data storage and processing is fundamental task of any organization. Hadoop is a distributed framework created to handle the big data processing task. The aim of this paper is to study and analyze different aspects such as performance, flexibility, scalability and more on Hadoop clusters in the cloud and in commodity. Introduction Cloud computing is an abstract term describing the use of resources, which don t belong to the user to perform required task and then disconnect from the resources not in use. Buyya et al. [1] have defined it as follows: Cloud is a parallel and distributed computing system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources based on service-level agreements (SLA) established through negotiation between the service provider and consumers. Vaquero et al. [2] have stated clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically reconfigured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized Service Level Agreements. Cloud computing [3] technology started very early when mainframe computers were available in academia and corporations, accessible through clients computers to be shared. Cloud computing [5] is based on different computing research areas such as HPC, virtualization, utility computing, grid computing, networking, security and many others. Depending on service providers cloud computing can be broadly divided into: [4] Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service(IaaS). SaaS provides both resources and applications as a Service to the clients. PaaS provides a level above SaaS by enabling clients to access to platforms that they need to develop. And IaaS provides the storage and computing infrastructure to the clients over the internet. Based on technology used cloud computing can be subcategorized into: [4] Public, Private, Community and Hybrid clouds. A public cloud is accessible with an internet connection. A private cloud is limited to an organization or group. A community cloud is a private cloud shared between groups or organizations. And hybrid cloud is a mixture of at least two clouds explained above. There are many platforms available to set up a cloud. Openstack was chosen due to following reasons: First of all, it is open-source, meaning it is open to pick and mix any hardware needs, open to design own networks, open to use any virtualization technology, open to other needed features and so on. Secondly, it has the largest group of developers and contributors. Thirdly it is simple to configure and use. Data is an inseparable entity of any organization. According to [6] Every day, we create 2.5 quintillion bytes of data. The obvious cause is the advancement of technology and its use. The piled up data needs to be stored,

processed and analyzed to get useful information. There are many technical solutions to it but Hadoop is chosen due to following reasons: Hadoop makes data mining, analytics, and processing of big data cheap and fast. Hadoop is an open source project and is made to deal with terabytes of data in minutes. Hadoop is the only way that companies with gigantic amounts of data like Facebook, Twitter, Yahoo, ebay and Amazon can cost-effectively and quickly make decisions. Hadoop is easily scalable as hard drives and nodes can be added without need of shutting down Hadoop. Hadoop stores and processes any kind of data. Hadoop is natively written in Java but can be accessed using other languages such as SQL-inspired language (Hive), c/c++, python and many more. Knowledge on Hadoop is a must to understand the paper. The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It provides for building distributed system for data storage, data analysis, and coordination. A framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of Map-Reduce. Hadoop employs a master/slave architecture for both distributed storage and distributed computation. This paper is divided into five parts, Introduction, cloud, cluster, analysis and finally conclusion. The cloud section explains in brief about Openstack cloud. The cluster section explains the clustering of Hadoop system in real computers and in the cloud and also explains running of applications on the distributed systems. The analysis section describes the comparison of two system with appropriate graphs. The cloud According to the official Openstack starter guide [7] Cloud computing is a computing model, where resources such as computing power, storage, network and software are abstracted and provided as services on the Internet in a remotely accessible fashion.. An infrastructure setup using the cloud computing model is generally referred to as the "cloud". The guide further explains that Openstack is a collection of open source software projects that enterprises/service providers can use to setup and run their cloud compute and storage infrastructure. The installation of cloud (Openstack) consists of five projects: nova-compute, swift, glance, keystone and Horizon which is more clear with the following figures. Fig 1 simple Openstack architecture Fig 2 simple Openstack workflow

There are many reasons behind choosing Openstack. The first reason is the flexibility in Openstack regarding networking and hardware. Open-source is the second reason with millions of developers and contributors. Thirdly Openstack has become phenomenon meaning that big companies like Dell, AMD, Cisco, HP and others alongside Rackspace are using it. Furthermore, linux heavyweights like Red Hat and Ubuntu are implementing Openstack. Lastly but not the least, it is simple and flexible to configure and use. The clusters After the successful installation and configuration of Openstack cloud, virtual servers were created in the cloud. For this experiment, under Hadoop project, four instances were created, one acting as master/slave and the rest three as slaves. Four-node cluster was created for the Hadoop jobs to be run and analysed. The same configuration was applied to personal computers as well. The cluster in the cloud was made identical to the cluster in the real systems. The details of the clusters are as given in the table below. Table 1: details of the clusters Servers cloud vs real Details (personal computers) Details (virtual servers in cloud) Master vs kucse-dcg 2 GB Ram 2 VCPU 160 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave1 vs user 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave2 vs user1 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave3 vs user3 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Screenshot of the dashboard of the cloud is given below. Fig 3: dashboard of Kathmandu University cloud with Hadoop project

After the successful configuration of the clusters, three jobs: two jobs to convert image files to pdf files and one job of word count were run on both systems. The first two jobs were based on image and pdf files being serialized in map reduce framework [9][10][11], and the last job was available in the Hadoop package itself. The results were not so much contradictory which are summarized in the tables given below. Table 2: summary of first job Cluster in personal computers Cluster in virtual servers in cloud inputs 23 folder 94 image files 169 MB 23 folder 94 image files 169 MB outputs 23 folder 94 pdf files 90.1 MB 1 folder 94 pdf files 90.1 MB Time taken 3 minutes and 8 seconds 1 minute and 31 seconds Table 3: summary of second job Cluster in personal computers Cluster in virtual servers in cloud inputs 1 folder 476 image files 926 MB 1 folder 476 image files 926 MB outputs 1 pdf files 200.1 MB 1 pdf files 200.1 MB Time taken 7 minutes and 51 seconds 9 minutes and 22 seconds Table 4: summary of third job Cluster in personal computers Cluster in virtual servers in cloud inputs 1 text file 1.1 GB 1 text file 1.1 GB outputs 1 text file with counts 364.6 KB 1 text file with counts 364.6 KB Time taken 4 minutes and 0 second 5 minutes and 1 second Three graphs generated from the above tables are given below. Graph 1 time taken for the first job Graph 2 time taken for the second job Graph 3 time taken for the third job

Analysis The Hadoop distributed system set up on personal computers was certain to be more efficient and faster than the cloud system. The obvious reasons were that Hadoop framework is developed with commodity machines in mind and that the processing is done in real hardware without any resources sharing as compared to cloud systems. The first job was contradictory with the points discussed above and with other two jobs. The reason is that the job has to recursively read and write files, thus has to cache all the bytes read and due to hardware inefficiency in the personal computers used (especially memory), the job has to be canceled from the inefficient machines (task trackers). Thus the job was a bit slow in real computers due to cancellation of jobs and transferring the job to other tasktrackers (nodes/machines). To sum up, the personal computers Hadoop cluster was as powerful and fault tolerant as the cloud Hadoop cluster but not as scalable and flexible as the cloud cluster. The reasons were that the real systems are of real hardware and can not be made available easily but the cloud cluster was of virtualized system so could be varied according to needs and availability. Creating a new virtual server (node) in the cloud could be done in seconds but adding a new node in the real system had to account the availability of computers. A problem in virtual server can be solved by terminating and creating a new one but the same could not be applied in real system. The real system were connected with network cables so had a little but insignificant delay in communication between nodes in real system while the nodes were inside the same project which made the communication between nodes much faster. Most important aspect of the cloud is that server creation and termination is extraordinarily easy i.e. the cluster can be scaled up and down with great ease. Another aspect of the cloud is that the cluster can be made public without any difficulty as we can see in the above screenshot that the master server s web interface is accessed through browser in the client. Conclusion This paper is an analysis of running a Hadoop cluster in cloud and in real system and identifying the best solution by running simple Hadoop jobs in the configured clusters. This paper concludes that running a Hadoop cluster in cloud for data storage and analysis is more flexible and easily-scalable than the real system cluster. This paper also concludes the cluster in real system computers are faster than the cloud clusters. But due to different advantageous features of the cloud computing system such as quick termination of servers (nodes) if problems arise and creation of the node from the same state the machine was terminated, automatic networking, instant creation of nodes and cluster and many such features cloud Hadoop cluster would be more favorable. References [1] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility, Future Generation Computer Systems, 25:599_616, 2009. [2] L. M. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, A break in the clouds: Towards a cloud definition, SIGCOMM Computer Communications Review, 39:50_55, 2009. [3] Cloud Computing on Wikipedia, en.wikipedia.org/wiki/cloudcomputing [4] http://www.us-cert.gov/sites/default/files/publications/cloudcomputinghuthcebula.pdf [5] Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and Andrzej Goscinski

[6] http://www-01.ibm.com/software/data/bigdata/ [7] Openstack compute starter guide, May 4 2012 (essex) [8] http://hadoopinku.wordpress.com/category/hadoop-2/ [9] Tom White, Hadoop: The Definitive Guide, O Reilly Media, 2009 [10] Chuck Lam, Hadoop in Action, MEAP Unedited Draft, 2010 [11] Bruno Lowagie, itext in Action, 2010