Knowledge Management in Cloud Using Hadoop Vishwas Churihar Dept. of Computer Science Engineering Laxmi Narain College of Technology Bhopal, India (M.P.) Vishwas.churihar@yahoo.com Shweta Shrivastava Dept. of Computer Science Engineering Laxmi Narain College of Technology Bhopal, India (M.P.) Shwetashri.26@gmail.com Abstract Cloud computing is a colloquial expression used to describe a variety of different types of computing concepts that involve a large number of computers that are connected through a real-time communication network (typically the Internet). (2)The characterstics of cloud defined by the NIST (National Institute of Standard And Technology) are (i) On demand self- service, (ii) Broad network access, (iii) Location independent or resource pooling, (iv) Rapid elasticity, (v) Measured services. (3) The services provided by the cloud are (i) SaaS (Software-as-a- Service), (ii) PaaS (Platform-as-a-Service), (iii) IaaS (Infrastructure-as-a-Service). (4) The essential software used in creating an effective and efficient cloud is Hadoop. Hadoop means running a set of daemons, or resident programs, on the different servers in your network. Key Words Cloud Computing, Characterstics of cloud, Services of cloud, Software used to implement cloud using hadoop. I. INTRODUCTION Cloud computing is a colloquial expression used to describe a variety of different types of computing concepts that involve a large number of computers that are connected through a real-time communication network (typically the Internet). In science, cloud computing is a synonym for distributed computing over a network and means the ability to run a program on many connected computers at the same time. The popularity of the term can be attributed to its use in marketing to sell hosted services in the sense of application service provisioning that run client server software on a remote location.[1][3][4]. Fig. 1: Architecture of cloud. A. History a) 1950:- Scientist Herb George would operate on dump terminal powered by 15 large data center. b) 1960:- John MC car the appoint that computation may be some day be organized as a public utility Duglucks pork licks books the challenge of the computing utility explained all the modern day characterstics of cloud computing. c) 1969:- ARPANET developed and transferred itself its internet. d) 1990:- When internet age started. e) 1993:- Browser developed such as net scape and internet explorer. f) 1996:- The first SaaS launched that is Salesforce.com g) 2002:- Amazon launch Amazon web services (AWS) on utility computing AWS is a collection of remote computing service that together makeup a cloud computing platform. h) 2008:- Eucalyptus became the 1st open source Amazon web services (AWS) API platform for deploying private cloud. On March 1, 2011, IBM announced the IBM Smart Cloud framework to support Smarter Planet. Among the various components of the Smarter Computing foundation, cloud computing is a critical piece[1]. II. CHARACTERSTICS According to N.I.S.T Cloud computing exhibits the following characteristics: a) On demand self-service:- A client can provision computer resources without the need for interaction with cloud service provider such as server time and network storage as needed automatically without require human interaction with each service provider. b) Broad network access: - Access to resources in the cloud is available over the network using standard methods in a manner that provide platform independent access to client of all types. c) Location independent or resource pooling:- The provider computing resources are pooled to serve multiple customers using a multi-tenant order model with different 183
physical and virtual resources dynamically assign and reassigned according to the demand. d) Rapid elasticity:- The system can add resources by either scaling may be automatically or manual, cloud computing resources should look limitless and can be purchase at any time and in any quantity. e) Measured services:- A client can be charged abased on a metric such as amount of storage used, number transaction, network input-output or bandwidth amount of processing power used. A client is charged based on the level of service provider [3][4]. III. SERVICES PROVIDED BY A CLOUD Cloud computing providers offer their services according to several fundamental models: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS) where IaaS is the most basic and each higher model abstracts from the details of the lower models. Other key components in anything as a service (XaaS) are described in a comprehensive taxonomy model published in 2009, such as Strategy-as-a-Service, Collaboration-as-a-Service, Business Process-as-a-Service, Database-as-a-Service, etc. In 2012, Network-as-a-Service (NaaS) and communication as a service (Caas) were officially included by ITU (International Telecommunication Union) as part of the basic cloud computing models, recognized service categories of a telecommunication-centric cloud ecosystem. [1][3]. A. Infrastructure-as-a-Service (IaaS) In the most basic cloud-service model, providers of IaaS offer computers - physical or (more often) virtual machines - and other resources. (A hypervisor, such as Xen or KVM, runs the virtual machines as guests. Pools of hypervisors within the cloud operational support-system can support large numbers of virtual machines and the ability to scale services up and down according to customers' varying requirements.) IaaS clouds often offer additional resources such as a virtualmachine disk image library, raw (block) and file-based storage, firewalls, load balancers, IP addresses, virtual local area networks (VLANs), and software bundles.[] IaaS-cloud providers supply these resources on-demand from their large pools installed in data centers. For wide-area connectivity, customers can use either the Internet or carrier clouds (dedicated virtual private networks).examples of IaaS providers include: Amazon EC2, Google Compute Engine, HP Cloud, Joyent, Linode, NaviSite, Rackspace, and Ready Space Cloud Services.[3] B. Platform-as-a-Service (PaaS) In the PaaS model, cloud providers deliver a computing platform, typically including operating system, programming language execution environment, database, and web server. Application developers can develop and run their software solutions on a cloud platform without the cost and complexity of buying and managing the underlying hardware and software layers. With some PaaS offers, the underlying computer and storage resources scale automatically to match application demand so that the cloud user does not have to allocate resources manually. Examples of PaaS include: AWS Elastic Beanstalk, Cloud Foundry, Heroku, Force.com, Engine Yard, Mendix, Open Shift, Google App Engine, App Scale, Windows Azure Cloud Services, Oranges cape and Jelastic. [1][3] C. Software-as-a-Service (SaaS) In the business model using Software-as-a-Service (SaaS), users are provided access to application software and databases. Cloud providers manage the infrastructure and platforms that run the applications. SaaS is sometimes referred to as "on-demand software" and is usually priced on a pay-peruse basis. SaaS providers generally price applications using a subscription fee. In the SaaS model, cloud providers install and operate application software in the cloud and cloud users access the software from cloud clients. Cloud users do not manage the cloud infrastructure and platform where the application runs. This eliminates the need to install and run the application on the cloud user's own computers, which simplifies maintenance and support. Cloud applications are different from other applications in their scalability which can be achieved by cloning tasks onto multiple virtual machines at run-time to meet changing work demand. Load balancers distribute the work over the set of virtual machines. This process is transparent to the cloud user, who sees only a single access point. To accommodate a large number of cloud users, cloud applications can be multitenant, that is, any machine serves more than one cloud user organization. It is common to refer to special types of cloud based application software with a similar naming convention: desktop as a service, business process as a service, test environment as a service, communication as a service. The pricing model for SaaS applications is typically a monthly or yearly flat fee per user, so price is scalable and adjustable if users are added or removed at any point. Examples of SaaS include: Google Apps, Microsoft Office 365, Petro soft, Olive, GT Nexus, Marketo, Casengo, Trade Card, Sales force and Calliduss Cloud. [3] D. Network-as-a-Service (NaaS) A category of cloud services where the capability provided to the cloud service user is to use network/transport connectivity services and/or inter-cloud network connectivity services. NaaS involves the optimization of resource allocations by considering network and computing resources as a unified whole. Traditional NaaS services include flexible and extended VPN, and bandwidth on demand. NaaS concept materialization also includes the provision of a virtual network service by the owners of the network infrastructure to a third party (VNP VNO).[9][10] 184
the local files corresponding to the blocks. Furthermore, a Data Node may communicate with other Data Nodes to replicate its data blocks for redundancy. There is only One Data Node process run on any Hadoop slave node. Data Node runs on its own JVM process. On start-up, a Data Node connects to the Name Node. Data Node instances can talk to each other, this is mostly during replicating data. (1) Roles of the Name Node Fig. 2: Services provided by cloud. IV. SOFTWARE USED A. Hadoop In pioneer days they used oxen for heavy pulling, and when one ox couldn t budge a log, they didn t try to grow a larger ox. We shouldn t be trying for bigger computers, but for more systems of computers. On a fully configured cluster, running Hadoop means running a set of daemons, or resident programs, on the different servers in your network. These daemons have specific roles; some exist only on one server, some exist across multiple servers. The daemons include [2][6] a) Name Node b) Data Node c) Secondary Name Node d) Job Tracker e) Task Tracker a) Name Node The Name Node is the master of HDFS that directs the slave Data Node daemons to perform the low-level I/O tasks. The Name Node is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed file system. There is only One Name Node process run on any Hadoop cluster. Name Node runs on its own JVM process. In a typical production cluster its run on a separate machine. When the Name Node goes down, the file system goes offline. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The Name Node responds the successful requests by returning a list of relevant Data Node servers where the data lives [6][7]. b) Data Node Each slave machine in your cluster will host a Data Node daemon to perform the grunt work of the distributed file system reading and writing HDFS blocks to actual files on the local file system. When you want to read or write a HDFS file, the file is broken into blocks and the Name Node will tell your client which Data Node each block resides in. Your client communicates directly with the Data Node daemons to process and Data Nodes. Fig. 3: Interaction of Name node/data node in Hdfs. In figure 3, we show two data files, one at /user/chuck/data1 and another at /user/james/data2. The data1 file takes up three blocks, which we denote 1, 2, and 3, and the data2 file consists of blocks 4 and 5. The content of the files are distributed among the Data Nodes. Figure3:- Interaction of Name node/data node in Hdfs. c) Secondary Name Node The Secondary Name Node (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. Like the Name Node, each cluster has one SNN, and it typically resides on its own machine as well. No other Data Node or Task Tracker daemons run on the same server. The SNN differs from the Name Node in that this process doesn t receive or record any real-time changes to HDFS. Instead, it communicates with the Name Node to take snapshots of the HDFS metadata at intervals defined by the cluster configuration.[6] d) Job Tracker Job Tracker is the daemon service for submitting and tracking Map Reduce jobs in Hadoop. There is only One Job Tracker process run on any Hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster it run on a separate machine. Each slave node is configured with job tracker node location. The Job Tracker is single point of failure for the Hadoop Map Reduce service. If it goes down, all running jobs are halted. Job Tracker in Hadoop performs following actions:- The Job Tracker talks to the Name Node to determine the location of the data. The Job Tracker locates Task Tracker nodes with available slots at or near the data. 185
The Job Tracker submits the work to the chosen Task Tracker nodes. The Task Tracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different Task Tracker. A Task Tracker will notify the Job Tracker when a task fails. The Job Tracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the Task Tracker as unreliable.[7] e) Task Tracker A Task Tracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a Job Tracker. There is only One Task Tracker process run on any Hadoop slave node. Task Tracker runs on its own JVM process. Every Task Tracker is configured with a set of slots; these indicate the number of tasks that it can accept. The Task Tracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The Task Tracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the Job Tracker. The Task Trackers also send out heartbeat messages to the Job Tracker, usually every few minutes, to reassure the Job Tracker that it is still alive. These messages also inform the Job Tracker of the number of available slots, so the Job Tracker can stay up to date with where in the cluster work can be delegated. This topology features a master node running the Name Node and Job Tracker daemons and a standalone node with the SNN in case the master node fails. For small clusters, the SNN can reside on one of the slave nodes. On the other hand, for large clusters, separate the Name Node and Job Tracker on two machines. The slave machines each host a Data Node and Task Tracker, for running tasks on the same node where their data is stored.[2][6] Fig.4 Interaction of Job Tracker/Task Tracker in Hdfs Fig. 5 Topology of Hadoop cluster B. Components of Hadoop HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. HDFS supports write-once-read-many semantics on files. NFS (Network File System). It is one kind of file system where data can reside on one centralized machine and all the cluster member will read write data from that shared database, which would not be as efficient as HDFS. C. Differences between HDFS and NFS In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NFS data is stored on dedicated hardware. HDFS is designed to work with Map Reduce System, since computation is moved to data. NFS is not suitable for Map Reduce since data is stored separately from the computations. HDFS runs on a cluster of machines and provides redundancy using replication protocol. Whereas NFS 186
is provided by a single machine therefore does not provide data redundancy. services and cloud implementing software s like hadoop. The Information Management Cloud provides IaaS and SaaS. And in future we develop this cloud for a company. D. Map Reduce Map reduce is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce. The main Map Reduce job usually splits the input data-set into independent chunks. (Big data sets in the multiple small datasets. V. KNOWLEGDE MANAGEMENT CLOUD Cloud computing is getting popular today and IT giants such as Google, Amazon, Microsoft, IBM have started their cloud computing infrastructure. This paper proposed to give an overview survey of current cloud computing architecture, services and cloud implementing software s like hadoop. The Knowledge Management Cloud provides services as IaaS and SaaS. Knowledge Management cloud stores data of an organization whose employees do marketing and need information about the company product and the information of competitor companies' product anytime. By storing data on this cloud the employees are able to get the information anytime. A. Purpose of the cloud Knowledge Management cloud is the that cloud which stores Data of an organization which helps the employees to get the Information about the product that is the name, description and the reviews of that product and also get the information about the competitor companies products. The cloud provides services like Iaas and SaaS. And by storing data on this cloud the employees are able to get the information anytime. B. Implementation of the cloud To implement the Knowledge Management cloud we need Linux operating system in which we configure Hadoop i.e. middleware software and Eclipse software that provide advance java platform. The Knowledge Management cloud divided into some modules named as follows:- a) Subscription Master b) Organization Registration c) Department Information d) Employee Information e) Product Information f) Competitor Company Information To use the facility of this cloud the organization has to register and then it has to select the subscription plan and then the employee get the id and then it can access the data that is stored on the cloud. REFERENCES [1] Sun Microsystems, Introduction to Cloud Computing Architecture, 2009 [2] Hadoop MapReduce, Hadoop.apache.org/mapreduce [3] Autonomic Cloud Computing: Open Challenges and Architectural Elements 1209.3356 IEEE. [4] Peter Mell, and Tim Grance, The NIST Definition of Cloud Computing, Version 15, 10-7-09, http://www.wheresmyserver.co.nz/ storage/media/faq-files/cloud-def-v15. [5] U.S. General Services Administration, Federal Cloud Computing Initiative Overview, www.scribd.com/doc/18031511/us-federal- Cloud-Computing-Initiative-Overview-Presentation-GSA, [6] Hadoop Distributed File System, Hadoop.apache.org/hdfs [7] Hadoop Open Source Project,http://Hadoop.apache.org/core/, 2009. [8] Bechtolsheim A (2008) Cloud computing and cloud networking. Talk at UC Berkeley. [9] J. Ekanayake and S. Pallickara, Map Reduce for Data Intensive, Scientific Analysis, Fourth IEEE International Conference on escience, 2009 [10] IEEE transactions on parallel and distributed systems, vol. 24, no. 1, January 2013. VI. CONCLUSION Cloud computing is getting popular today and IT giants such as Google, Amazon, Microsoft, IBM have started their cloud computing infrastructure. This paper proposed to give an overview survey of current cloud computing architecture, 187