Cloud Computing based on the Hadoop Platform



Similar documents
Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Big Data and Apache Hadoop s MapReduce

Open source Google-style large scale data analysis with Hadoop

Research on Job Scheduling Algorithm in Hadoop

Hadoop Architecture. Part 1

Distributed Framework for Data Mining As a Service on Private Cloud

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Cloud Computing


Hadoop IST 734 SS CHUNG

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

How To Understand Cloud Computing

Ubuntu and Hadoop: the perfect match

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

CLOUD COMPUTING An Overview

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

CSE-E5430 Scalable Cloud Computing Lecture 2

IJRSET 2015 SPL Volume 2, Issue 11 Pages: 29-33

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Improving MapReduce Performance in Heterogeneous Environments

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Hadoop. Sunday, November 25, 12

Energy Efficient MapReduce

International Journal of Innovative Research in Computer and Communication Engineering

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

A Performance Analysis of Distributed Indexing using Terrier

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Map-Reduce for Machine Learning on Multicore

Open source large scale distributed data management with Google s MapReduce and Bigtable

NoSQL Data Base Basics

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Cloud Computing Now and the Future Development of the IaaS

Apache Hadoop. Alexandru Costan

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

MapReduce and Hadoop Distributed File System

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

MapReduce and Hadoop Distributed File System V I J A Y R A O

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Hadoop

[Sudhagar*, 5(5): May, 2016] ISSN: Impact Factor: 3.785

Log Mining Based on Hadoop s Map and Reduce Technique

Efficient Data Replication Scheme based on Hadoop Distributed File System

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Apache Hadoop FileSystem and its Usage in Facebook

UPS battery remote monitoring system in cloud computing

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

CLOUD COMPUTING. When It's smarter to rent than to buy

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Big Data Storage Architecture Design in Cloud Computing

Big Data Analytics OverOnline Transactional Data Set

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big Data on Cloud Computing- Security Issues

BIG DATA SOLUTION DATA SHEET

5 SCS Deployment Infrastructure in Use

Introduction to Hadoop

Manifest for Big Data Pig, Hive & Jaql

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Apache Hama Design Document v0.6

How To Handle Big Data With A Data Scientist

Cloud Computing Paradigm

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Analytics Hadoop and Spark

Application Development. A Paradigm Shift

Advanced Big Data Analytics with R and Hadoop

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Big Data on Microsoft Platform

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Exploration on Security System Structure of Smart Campus Based on Cloud Computing. Wei Zhou

How to Do/Evaluate Cloud Computing Research. Young Choon Lee

Resource Scalability for Efficient Parallel Processing in Cloud

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Transcription:

Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm. It is a pay-per-use model for triggering available, suitable, on-demand network access to a shared pool of configurable computing resources. Hadoop deals with the massive data with the help of parallel processing. The Apache Foundations developed Hadoop which is applied widely as the most popular distributed platform. The focusing points of this paper are, a Hadoop platform computing model and the Map/Reduce algorithm. We did combine the K-means with data mining technology to implement the effectiveness analysis and application of the cloud computing platform. Key words:-cloud computing; Hadoop; Map/Reduce; Data mining; K-means I. INTRODUCTION [4] Along with the maturing of computer network technology, especially today, the cloud computing technology has been widely recognized and applied. The IT giants companies such as Google, IBM, Amazon and Microsoft have launched their own commercial products; they also make the cloud computing technology as the priority strategy in the future development. But this will lead to a large number of data problems: the explosive growth of online information. Each user may have a huge amount of information. From now on, the transistor circuit has been gradually approaching its physical limits. In the past; the Moore s law of each 18 months CPU efficiency double has come to the failure point. Facing the massive information, how to manage and store the data are the important issues we should deal with. 55

Hadoop is a building cloud computing platform which is designed by Apache open source projects. We use this framework to solve the problems and manage data conveniently.there are two major technologies: HDFS and Map/Reduce. HDFS is used to achieve the storage and fault-tolerant of huge document, Map/Reduce is used to compute the data by distributed computing. II. BACKGROUND KNOWLEDGE AND RELEVANTCONCEPTIONS Cloud computing was developed from a variety of network technologies, including parallel computing, distributed computing and grid computing etc. It will implement all the tasks by using the cloud virtually and accomplish the purpose of combining all the cheap computer points into the huge system which can provide the large computing capacity. In other words, we can regard as separating the monitor and main engine. As the Figure 1 shows: Figure 1 Cloud Computing Structure Like the bank system. We can store the data and use the application as conveniently as we can save and manage money in the bank. The user no longer need to use so many hardware as background supporting, the only thing is connecting to the cloud. A Parallel Computing Parallel computing is a method to upraise the efficiency of the computing capacity and it solves the problem by using the multiple resources. The main principle is that we split the task into N parts and send the task to the N computers, so the efficiency is increased N times. But the parallel computing has a serious shortcoming which is that each part is relevant, that is a barrier for parallel computing development. B Distributed Computing 56

The basic principle of distributed computing and parallel computing is consistent. The advantage is that the distributed computing has a very strong fault-tolerant and easily to expand computing capacity by increasing the number of computer nodes. The difference is that the part is independent of each other, so a batch of computing nodes failure would not affect the accuracy of calculation. III. THE STRUCTURE OF CLOUD COMPUTING Cloud computing platform is a powerful cloud network to connect a large number of concurrent services, and can be used to extend each of virtual servers. Combining every source through the platform is in order to support the huge computing and storage abilities. The general cloud computing system structure shown as Figure 4. Figure 2: Cloud Computing Platform A Hadoop Structure [3] Hadoop, developed by the Apache Company,is a distributed system basic structure. It makes the users program the distributed software easily even they know nothing about the bottom circumstances. HDFS, which is the base layer, is a main storage system in Hadoop and running on the ordinary cluster components. Usually it is deployed in the low-cost hardware devices to provide high rate of transmission and access the application data. This is suitable for the program which has large dataset. B Map/Reduce [1] Map/Reduce, presented by the Jeffery Dean and Sanjay Ghemawat, is a programming model in the assive data computing, which is developed by Google and meanwhile is a core technology of cloud 57

computing. The model abstracts the common operation of large dataset as Map and Reduce steps to simplify the programmers difficulty of distributed and parallel computing. Figure 3: Map/Reduce process IV. APPLICATION OF CLUSTER Cluster algorithm is an important factor in data mining, especially in such a large system of data computing. The clustering is particularly vital in cloud technology. The division of data characteristics is the most vital step in the storage and security of cloud computing. There are so many algorithms in cluster analysis and we will combine the K-means and Map/Reduce to discuss the distribution of data in the Hadoop platform. K-means Algorithm [3] K-means algorithm is an objective function which is aiming at optimizing the distance between the data point to the center point. That distance uses Euclidean distance similarity as measure. This function takes the clustering meet some rules: the similarity of objects in the same cluster is higher, and the difference between the cluster types has less similarity. V. Conclusion After studying cloud computing based on Hadoop platform thoroughly, I concluded that Data storage is an important element of cloud computing. This paper discussed the major core technologies HDFS and Map/Reduce in Hadoop framework. Combination of data mining and K-means clustering algorithm will make the data management is easier and quicker in cloud computing model. Even 58

though this technology is still in its infancy, we believe that along with the continuous improvement, the cloud computing will develop towards the security and reliable directions. REFERENCES [1] WangXiangqian. Optimization of High Performance MapReduce System. Computer Software Theory, Univerisity of Science and Technology of China. 2010. [2] ZhuZhu. Research and application of massive data processing model based on Hadoop.Beijing University of Posts and Telecommunications. 2008. [3] YangChenzhu. The Research of Data Mining Based on Hadoop. Chongqing Univerisity. 2010 QiuRongtai. Research on MapReduce application based on Hadoop, Henan Polytechnic University, 2009. [4] WangPeng. Into Cloud Computing.. People Posts and Telecommuniations Press.2009 (In Chinese). 59