@ 2014 SEMAR GROUPS TECHNICAL SOCIETY.

Transcription

1 ISSN Vol.03,Issue.11 June-2014, Pages: Enhance the Performance of Cloud Computing with Hadoop Dept of CSE, College of Science, Kirkuk University, Ministry of Higher Education & Scientific Research, Iraq, Abstract: Cloud computing start as an option, now slowly turn to become a necessity, cloud provide a quick solution, it can be consider cheap if it compared to another solutions, but the cloud like any other thing has disadvantages, which need to be vanished, to enhance the cloud environment, in the cloud especially in the IaaS services need to improve these words pay less, get more profit, not only in the IaaS service also keep the client s data save, keeping the data save can be the first request to the client to make him/her trust in the cloud, and many other more. In the other side there a technology called Hadoop, which can be consider a new technology in the world of cloud computing, hadoop depend on smart strategy, hadoop uses cheap hardware requirements but provide much more, its provide fast processing of data comparing to the cheap environment, provide more than of the provided hardware requirements for storage, its provide a technique for saving the data from losing and more, hadoop improved its success with many of the successful web browsers like twitter, Facebook etc. In the previous paper the performance of analyzing data with help of both hadoop and the cloud had been examined, using the basic tools of hadoop, the results was impressive, the results was 86% less of time comparing with big size of the processed files. In this paper using another strategy of enhancing the performance of both hadoop and the cloud computing, in this paper the results that obtained, when it comparing previous paper, 98% of time comparing with big size of processed files had been got. Keywords: Cloud Computing; Hadoop; Java. I. INTRODUCTION Based on "National institute standards and technology" (NIST) in 2013[10] the Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Cloud computing three major types of services, SaaS is the model that enabling an application is hosted as a service to customers who access it via the Internet, PaaS the second type of cloud computing services in this service the service provider is responsible of delivering another type of applications, these applications are the software resources, the PaaS vendor supplies all the resources required to build applications and services completely via the internet, without having to download or install software, IaaS is simply offers the hardware so that your organization can use it, rather than purchase servers, racks, so simply you can use the data center without building and purchase it, the service provider rents those resources to you, the IaaS can be dynamically scaled up or down, based on the application resource needs. [8] Hadoop is an open source framework, which deal with the distributed system and has the ability to analysis and execute BIG DATA or files that contain huge size (terabytes and more). Hadoop consist of two major components the first one the servers which responsible of the storage, the second one is the technique used to connect the servers inside hadoop cluster. The first part called HDFS, HDFS Characterized from the others distributed systems such as NFS, NFS provides one machine to store the client s files by making this portion is visible to them, so think if there is a big load in this server from the clients, this server can be crashed, and many other problems in the other distributed systems, HDFS has the ability to solve the problems of the other distributed systems, HDFS divide the storage in three major components, these components are NameNode, DataNode, Secondary Name Node, the NameNode work as an index first of all the NameNode receives incoming files which need to store for later processing from the clients, divide the data, then distribute these partitioned data to the other servers in the cluster of hadoop, DataNode is the server which work for storing the divided data which come from the NameNode, also hadoop provide a backup for the NameNode if any kind of failure happens, the Secondary NameNode which work as a backup for the server NameNode, the second component called MapReduce the heart of hadoop, MapReduce is a programming framework tool, originally created by Google but later had been developed by Apache hadoop, consist of JobTracker TaskTracker, the client contact with JobTracker send a request for processing the data, the JobTracker send 2014 SEMAR GROUPS TECHNICAL SOCIETY. All rights reserved.

2 the request to one of the TaskTrackers, then the processing start. II. RELATED WORKS [1] The researchers in this dissertation had merged hadoop and cloud computing and implemented the security access to the cloud using finger print identification and face identification, the hadoop cloud computing used in this dissertation connected to the mobile and thin clients and had been connected using wired or wireless network, the master server connected to the salves server through hadoop cloud computing, the implemented cloud computing in this dissertation produce services SaaS, PaaS, IaaS, the server operating system implemented in this dissertation was Linux platform connected via Ethernet, Wi-Fi, 3G, to connect to the client, the cloud had been built using (JamVM) virtual machine as a platform to build J2ME that work as a java platform for mobile clients. The results showed that finger identification and face identification had been processed within only 2.2 seconds to get the identification of the person. [2] The researchers in this dissertation had studied and improved that the performance of Vm, can be increased with the load balancing to all Vm in the cloud, they had implanted hadoop among these Vm to get the load balancing feature, the example cloud they had used is Eucalyptus cloud,the system that they proposed EuQoS system also they tested the proposed system in the real- time data,then they made a comparison between their proposed system and the normal Eucalyptus cloud hadoop with the they had found that their proposed system improved their performance with 6.94% comparing with hadoop. The proposed EuQoS system consist of HDFS, MapReduce, HBase, and load balancing, the MapReduce task be responsible of mapping the jobs and the HDFS be responsible of saving the status of mapping and produce it to the reduce task, in the HDFS they had inherit the basic idea of HDFS and created DFS which contain of namenode that will be responsible of open and name and clos the files and datanode it will be responsible of read or write the request and deal the actual files that stored in it, in the HBase phase in the EuQoS system will be merged with the Eucalyptus cloud, the HMaster server will assign the work to the Hregion and if these Hregion failed to do the work, then the HMaster will reassign the work to another Hregion, the Hregion server be responsible on the on the read and write and the requests that come from HMaster the last components of the EuQoS system is Load balancing, the load balancing consist from two components load balancer and agent-based monitor, the load balancer is responsible of these three functions balance triggering, EuQoS scheduling and VM controlling. As a result the researchers had used the IPV for the connection between the VM and they found that the performance of Eucalyptus cloud with hadoop was less than the EuQoS system. [3] The researchers in this dissertation made comparison between the other PAAS used in the cloud (hadoop, Dryad,HPCC), also they proposed by using cloud computing and hadoop an application handle one of the most complexity data the vision computing, they had created a Rhizome cloud for detection and computing the vision with the help of hadoop so there will be a speed analyzing for the captured data from video, they had used an representation as a input for the cloud, the nodes used in the cluster of hadoop treated as a grains, the algorithm used among the nodes are (FIFO), each grain worked Independent from the other grains, because hadoop offer this feature which increase the ability to protect the whole cloud and also ensuring that when some failure happen in one of the grains will affect to the other grains, the grains work under the grain manager (masternode ) the grain manger will take the responsible of watching the work of the other grains, these grains are responsible of computing the vision by using video capture, motion detection or even matrix multiplication, the performance of hadoop with JNI Rhizome had been tested with another application using hadoop with Rhizome, the input files was for two hours video,the INJ with Rhizome had taken for the frame around 80 minute, but using only hadoop with Rhizome take for the same input of frames around (31) minute, also another comparison by using the matrix multiplication the video with high Quality (1920 x 1080) take 8 M/Sec while with the using hadoop for analyzing the Surveillance cameras had improved getting speed result with only little Milliseconds. [4] In the previous paper an investigation had been built to improve the performance of hadoop and the cloud computing, the performance of the cloud computing with the basic tools of hadoop, had improved a success with 86 %, so in this paper another investigation will made based on the previous paper another scenario built according to the need of the client, some needed to increase their tools, another needed to decrease the used tools, and to make this investigation fair enough, same program had been used, but had been built differently than the previous scenario. first part of this paper a rich material will be produce on the Cloud Computing and Hadoop, to understand the characteristics,architecture and the work nature of both Cloud Computing and Hadoop separately, The second part of this paper is listing a group of the previous works that had done in the previous years, also studying these works with some kind of details, The third part of this paper list the objectives of this paper, the forth part aims to inauguration a tiny virtual cloud, then inauguration of Hadoop single cluster in one machine, building two different Cases to investigate the behavior of the work of Cloud Computing and Hadoop, The fifth part will presents the results of processing of 72 files of different sizes and contents from the two Cases that had been built in the fourth part, then a discussion is obtained with results, to determine if there is an improvement, based on the merging of the Cloud Computing with Hadoop, The sixth part present a final conclusion. III. OBJECTIVE The objective of this paper, build a tiny virtual cloud and merge hadoop with the cloud, then built two different kinds of cases, one of these cases include the enhanced hadoop and the cloud both together, the other case is the cloud itself

3 without hadoop, then test the performance of hadoop and cloud, compare the obtained results of enhanced hadoop and the cloud in one side and in the other side the results of the cloud without hadoop can be increased or not. IV. METHODOLOGY First of all a tiny cloud of normal resource requirement had been established then single cluster hadoop installed, this cloud include all component of hadoop in the same cloud, in this cloud a virtual data center had been built, which mean the master server and the slave server in the same cloud. The administrator of the cloud can monitor the performance of the running servers in the virtual data center, using two window interfaces which provide the complete information of the virtual data center. The health of the HDFS or the servers can be monitor, using especial site, the administrator can connect to the HDFS with this port number (50070). Also monitoring the performance of MapReduce, and monitoring every request come from the client, and how this request will be served, how many time it will take till finishing each request, from this port number this job can be accomplish (50030). Figure1. The HDFS of Hadoop. Figure2. The MapReduce of Hadoop. Enhance the Performance of Cloud Computing with Hadoop Since hadoop had been built with Java language, also the second case of the cloud using only java, so java programming language had been also installed, the platform used to work with java, and connecting java with hadoop was Eclipse, one of the platforms used to run java language, what make this platform special, it s characterized with simplicity and with friendly interface. Two different kinds of scenarios had been built to test the performance of hadoop and cloud computing, first case is to test the performance of hadoop with cloud computing, the second case is to test the performance of cloud computing itself, without hadoop, the performance of this two cases aims on the ability to process the big data or files of big sizes. A. Case (1) In this case had been created another different two of MapReduce programs, to test the performance of the hadoop using more tools in one of program, these used tools used in this case didn t used in the previous paper, also in the other program had been decreased the used tools in the previous paper, this case had been done in two ways. The first one is building a MapReduce program to test the performance of cloud computing and hadoop in processing the files of words type. The second one is build a MapReduce program, to process big sizes files of numbers contents only. The MapReduce programs used in this case are: A- Search Program. B- Maximum Program. 1. Search Program The purpose of this program is for searching for a specific content, also print how many this content repeated in a specific file, the entries to this program must be only of alphabetic contents, To ensure processing of files with this program done in an effective way, first of all particular job created to this program, the job retrieves the files needed to be processed from the server where its already reside on it, also to processing it in parts: a. The search Algorithm: Mapper class Name = the word want to find Map function (value, key) While there is more values in the file do I = 0 Curr= next word in the file If ( curr = name) I = I+ 1 If ( I > 0) Printout the name found and the number of Appearance b. Partition Part: in this part, the job request from the client to give the word he/she want to find, then the job start

4 creating key and value, then the job broke the contents of file, taking each word and assign a key for each word in the file, each broken word come under value and each value take a particular numeric key, after partition the content of file the job takes the needed word as a searching key, start reading the intermediate words one after one, whenever the word that the client want to find is matched then the job count this word, till reaching to the end of these intermediate outputs, then the founded word and how many times it found print out to a file in the mentioned directory in the server as the client request. 2. Maximum Program This program is for processing files and to extract the maximum number from the whole content of file, and produce the result in the file print to the server, the file type that this program uses is only numeric files, To ensure processing of files with this program done in an effective way, when the client request to retrieve the maximum number of numeric file, job create and request from the client to locate in the server which needed to find the maximum number of it also to processing it in parts: a. The Maximum Algorithm: Mapper class Map function (value, key ) While there is more values in the file do Produce (value, key) Combiner class Combine function ( key, value) max = smallest value in the file For I to all values in the file do D=I; If(d > max) Max = d Produce (max, key) Reducer class Reduce function (key, value) max = smallest value in the file For I to all values in the file do D=I; If(d > max) Max = d Printout the maximum number b. In this part, in this part, the job breaks the contents of file to set of values, each value take a numeric key, the numbers treated as a strings and comes under the attribute value, and each value has a unique key, the job produce the output set of values and its keys and prepare it to the next part of finding Maximum number. c. Finding the Maximum number part 1: the job take the intermediate outputs, from partition part and make them inputs to the next part, now in this part will not take the same original file, it will take set of values and its keys, the job start here searching the minimum number of the intermediate output from the previous part, then the job take this minimum number as a maximum number and start compare the whole intermediate values with this number whenever found a number that is greater than this number take it as a maximum taking each value, so the job here produce as a output new intermediate values, where the founded maximum number be produced from this part with its key, then as a result the job produce new set of founded maximum numbers with its key. d. Finding Maximum number Part 2: the job take the intermediate outputs, from (Finding Maximum number Part1) where at most it s found the maximum numbers but not as a final result, only to reduce the work of the job in the final part, in this part again the job take the same procedure of finding the maximum number, the job again search to the minimum number, among the set of the founded maximum numbers and then the job take this minimum number as a maximum number and start compare the whole intermediate values with this number whenever found a number that is greater than this number take it as a maximum, till reach the end of the intermediate number, then the job print the maximum number in a file and print it to the directory located from the client. B. Case (2) In this case the test based on testing the behavior of cloud computing without using hadoop, and to make the test based on the consistent basic`s, programs used in this testing are the same in the name but of course completely different from the programs used in the last two cases, also since the tested behavior based on testing the behavior of the cloud using files of numbers, also using files of characters, so in the third case, had been created two programs, one of the two programs created to processing the different sizes of files, containing only numbers, the second program for processing different sizes of alphabetic files. The MapReduce Programs are: A- Search Program. B- Maximum Program. 1. Search Program The purpose of this program is for searching for a specific content, also print how many this content repeated in a

5 Enhance the Performance of Cloud Computing with Hadoop specific file, the entries to this program must be only of alphabetic contents, using normal java, had been created normal program for search complete file and find the particular word, also counting how many this word repeated in this program. 2. Maximum Program This program is for processing files and to extract the maximum number from the whole content of file, and produce the result in the file print to the server, this program work only with numeric content of files, using normal java, had been created normal program for finding the maximum number in the complete file, find the minimum number in the file, then taking this number as a the maximum number and making comparing with all the content of file, then print out the maximum found number. V. RESULTS According to the two cases had been created in the cloud computing environment, to make the simulation near as much as possible to the reality, in this paper had been created different files of txt type, To test the virtual cloud with and without hadoop, also the cloud ability to process the two cases, in this paper had been used different size of files, to make the analysis of cloud computing and hadoop, fair as much as possible, the size of the files used in the cloud, twelve different size of files, these files are contain only numbers, and their sizes are (30, 60, 90, 120, 150, 180, 200, 210, 240, 270, 300, 400) MB, also another twelve different size of files, but these files contain only words, and their sizes are (30, 60, 90, 120, 150, 180, 200, 210, 240, 270, 300, 400) MB. A. Case (1): Using 24 files in total of txt files, two different programs had been tested in this case and under different circumstances the results that had been obtained from the first case using normal hadoop. Figure4: The Case (2) Results. VI. DISSCUSION A. Maximum Program The behavior of this program had been test using case(1) in the virtual cloud with basic tools of hadoop,, in the case(2) in the virtual cloud without using hadoop, the results of the processing time in these three cases using the same txt files and the same sizes of these files are shown in the table below, Maximum program to get the maximum number in the file, for the minimum file of size 30 MB the elapsed processing time for the whole file was (21.604) seconds, and for the maximum file of size 400 MB was ( ) seconds, that means in the 30 MB file the elapsed time was 36 seconds and as a maximum in the 400 MB was 3 minutes and 57 seconds, the results taken with different circumstances. According to the above table, case (1) that had been tested under the virtual cloud using more tools in hadoop had proved better improvement comparing with the case (2) for the same program running under the virtual cloud without using hadoop, the improvement range was between ( %) as a minimum and ( %) as a maximum, so the total average improvement was ( %). Figure3: The Case (1) Results. B. Case (2): In the second and last case the using only normal java without using hadoop, and using in total 24 files of txt types the results was : TABLE1. MAXIMUM PROGRAM COMPARISON Size Case(1) Case(2) Improvement %

6 B. Search Program: This program had been tested with three cases, in the case (1) this program had been tested with the virtual cloud with hadoop that uses here the necessary tools for executing the specific program, the second case the program had been tested with case (2) using the virtual cloud without using hadoop, the processing time that the program taking with 12 txt files with these three cases are: TABLE2. SEARCH PROGRAM COMPARSION Size Case(1) Case(2) Improvement % In the search program was only for searching a specific word chosen randomly, for the minimum size of file 30 MB the elapsed time for processing the complete file (13.348) seconds, and for the maximum size file the elapsed time for processing the complete file was (76.17) seconds, so the processing time for the file of 30 MB taking only 22 seconds, and in the maximum file of 400 MB the elapsed time was one minute and 26 seconds, these results was taken different circumstances. According to the below table, the behavior of search program in case (1) with case(2), in case (1) was showed better improvement than case(2), the improvement of case (1) was in range between ( %) and in maximum was ( %), the total average improvement of case (1) comparing with case (2) was ( %) VII. CONCLUSION In this paper the behavior of the cloud with hadoop had been tested, in the previous paper the improvement result was 89%, so in this paper the used tools of hadoop had been enhanced based of the situation of each program, each and every case and each and every file had been processed with different circumstances, to test the ability of the cloud to deal the different requests that come from different clients, with the normal hardware requirement that shown in the table below, the cloud and hadoop in case (1) improved its ability to take the different requests come from the clients, serve these requests with an optimum time, without affecting the machine health that hold the hole cluster, the maximum time was taken with cloud and hadoop was approximately 3.57 minutes with files of 400 MB. TABLE3. THE HARDWARE AND SOFTWARE REQUIRNMENTS Hardware and software requirements 1. Hard Disk 80 GB SCSI 2. Memory 2 GB 3. Processor Intel(R)Core M 4. CPU 2.40 GHz 5. Operating system SentOS final 64 bit Also the behavior of the cloud alone without hadoop had been tested, dealing with different request come from different clients, had been showed that serving each request from the clients, had made the machine acting very slowly and the processing time with the increasing of file size had also increased, which mean the cloud act in very bad way with the increasing of the number of the incoming requests from the clients, and also with the increasing of the file sizes. VIII. REFERENCES [1] L. Li and M. Zhang (2011) The Strategy of Mining Association Rule Based on Cloud Computing, International Conference on Business Computing and Global Informatization, pp [2] J. Chen, Y. Larosa, P. Yang (2012), Optimal QoS Load Balancing Mechanism for Virtual +Machines Scheduling in Eucalyptus Cloud Computing Platform, in the 2nd Baltic Congress on future Internet Communications, pp [3] L. Li and M. Zhang (2013), Rhizome: A Middle-Ware For Cloud Vision Computing Framework, ICSSC IEEE, pp [4] D. Raad, S. Singh, A Comparative Analysis Of The Performance Of Cloud Computing With Java And Hadoop, International Journal of Computer Science Engineering and Information Technology Research,pp , [5] Cloud Computing: A Practical Approach by Anthony T. Velte, Toby J. Velte and Robert Elsenpeter (2010). [6] Hadoop For Dummies, Special Edition by Robert D. Schneider (2012). [7] Hadoop in Practice by Alex Holmes (2012). [8] Hadoop: The Definitive Guide 3rd edition by Tom White (2012). [9] Hadoop in Action by Chuck Lam (2011). [10] NIST(National Institute of Standerds and Technology ), The NIST Cloud Computing Standreds Roadmap. America: Special Publication ,by Lee Badger, Tim Grance, Robert Patt-Corner and Jeff Voas (2012).

7 Enhance the Performance of Cloud Computing with Hadoop [11] Welcome to Apache Hadoop, apache.org/. [12] Download Eclipse for linux, censtos, [13] Download Java JDK for centos technetwork/java/javase/downloads/jre7-downloads html. [14] Download VMPlayer vmware/downloads. [15] Downloiad CentOS centos/6/isos/x86_64/. [16] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, Electron spectroscopy studies on magneto-optical media and plastic substrate interface, IEEE Transl. J. Magn. Japan, vol. 2, pp , August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]. [17] M. Young, The Technical Writer s Handbook. Mill Valley, CA: University Science, 1989.