Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm. It is a pay-per-use model for triggering available, suitable, on-demand network access to a shared pool of configurable computing resources. Hadoop deals with the massive data with the help of parallel processing. The Apache Foundations developed Hadoop which is applied widely as the most popular distributed platform. The focusing points of this paper are, a Hadoop platform computing model and the Map/Reduce algorithm. We did combine the K-means with data mining technology to implement the effectiveness analysis and application of the cloud computing platform. Key words:-cloud computing; Hadoop; Map/Reduce; Data mining; K-means I. INTRODUCTION [4] Along with the maturing of computer network technology, especially today, the cloud computing technology has been widely recognized and applied. The IT giants companies such as Google, IBM, Amazon and Microsoft have launched their own commercial products; they also make the cloud computing technology as the priority strategy in the future development. But this will lead to a large number of data problems: the explosive growth of online information. Each user may have a huge amount of information. From now on, the transistor circuit has been gradually approaching its physical limits. In the past; the Moore s law of each 18 months CPU efficiency double has come to the failure point. Facing the massive information, how to manage and store the data are the important issues we should deal with. 55
Hadoop is a building cloud computing platform which is designed by Apache open source projects. We use this framework to solve the problems and manage data conveniently.there are two major technologies: HDFS and Map/Reduce. HDFS is used to achieve the storage and fault-tolerant of huge document, Map/Reduce is used to compute the data by distributed computing. II. BACKGROUND KNOWLEDGE AND RELEVANTCONCEPTIONS Cloud computing was developed from a variety of network technologies, including parallel computing, distributed computing and grid computing etc. It will implement all the tasks by using the cloud virtually and accomplish the purpose of combining all the cheap computer points into the huge system which can provide the large computing capacity. In other words, we can regard as separating the monitor and main engine. As the Figure 1 shows: Figure 1 Cloud Computing Structure Like the bank system. We can store the data and use the application as conveniently as we can save and manage money in the bank. The user no longer need to use so many hardware as background supporting, the only thing is connecting to the cloud. A Parallel Computing Parallel computing is a method to upraise the efficiency of the computing capacity and it solves the problem by using the multiple resources. The main principle is that we split the task into N parts and send the task to the N computers, so the efficiency is increased N times. But the parallel computing has a serious shortcoming which is that each part is relevant, that is a barrier for parallel computing development. B Distributed Computing 56
The basic principle of distributed computing and parallel computing is consistent. The advantage is that the distributed computing has a very strong fault-tolerant and easily to expand computing capacity by increasing the number of computer nodes. The difference is that the part is independent of each other, so a batch of computing nodes failure would not affect the accuracy of calculation. III. THE STRUCTURE OF CLOUD COMPUTING Cloud computing platform is a powerful cloud network to connect a large number of concurrent services, and can be used to extend each of virtual servers. Combining every source through the platform is in order to support the huge computing and storage abilities. The general cloud computing system structure shown as Figure 4. Figure 2: Cloud Computing Platform A Hadoop Structure [3] Hadoop, developed by the Apache Company,is a distributed system basic structure. It makes the users program the distributed software easily even they know nothing about the bottom circumstances. HDFS, which is the base layer, is a main storage system in Hadoop and running on the ordinary cluster components. Usually it is deployed in the low-cost hardware devices to provide high rate of transmission and access the application data. This is suitable for the program which has large dataset. B Map/Reduce [1] Map/Reduce, presented by the Jeffery Dean and Sanjay Ghemawat, is a programming model in the assive data computing, which is developed by Google and meanwhile is a core technology of cloud 57
computing. The model abstracts the common operation of large dataset as Map and Reduce steps to simplify the programmers difficulty of distributed and parallel computing. Figure 3: Map/Reduce process IV. APPLICATION OF CLUSTER Cluster algorithm is an important factor in data mining, especially in such a large system of data computing. The clustering is particularly vital in cloud technology. The division of data characteristics is the most vital step in the storage and security of cloud computing. There are so many algorithms in cluster analysis and we will combine the K-means and Map/Reduce to discuss the distribution of data in the Hadoop platform. K-means Algorithm [3] K-means algorithm is an objective function which is aiming at optimizing the distance between the data point to the center point. That distance uses Euclidean distance similarity as measure. This function takes the clustering meet some rules: the similarity of objects in the same cluster is higher, and the difference between the cluster types has less similarity. V. Conclusion After studying cloud computing based on Hadoop platform thoroughly, I concluded that Data storage is an important element of cloud computing. This paper discussed the major core technologies HDFS and Map/Reduce in Hadoop framework. Combination of data mining and K-means clustering algorithm will make the data management is easier and quicker in cloud computing model. Even 58
though this technology is still in its infancy, we believe that along with the continuous improvement, the cloud computing will develop towards the security and reliable directions. REFERENCES [1] WangXiangqian. Optimization of High Performance MapReduce System. Computer Software Theory, Univerisity of Science and Technology of China. 2010. [2] ZhuZhu. Research and application of massive data processing model based on Hadoop.Beijing University of Posts and Telecommunications. 2008. [3] YangChenzhu. The Research of Data Mining Based on Hadoop. Chongqing Univerisity. 2010 QiuRongtai. Research on MapReduce application based on Hadoop, Henan Polytechnic University, 2009. [4] WangPeng. Into Cloud Computing.. People Posts and Telecommuniations Press.2009 (In Chinese). 59