HADOOP INTO CLOUD: A RESEARCH Prof.M.S.Malkar H.O.D, PCP, Pune msmalkar@rediffmail.com Ms.Misha Ann Alexander M.E.Student, LecturerPCP,Pune annalexander33@gmail.com Abstract- Hadoop and cloud has gained a lot of popularity in the IT industry.hadoop was invented by google which can handle a large amount of data i.e. either structured or unstructured data called BIG DATA.Hadoop spreads these data into number of clusters which makes it possible to handle large amount of data. The data processing is done in individual clusters and the results are given in a single set. Cloud computing is defined as scalable heterogeneous resources which is provided as services to us across Internet with needed bandwidth, security, computing capacity and reliability. As a part of green computing the IT industries are trying to reduce energy, cost invested on new infrastructure, software license etc. So cloud computing has gained a lot of popularity as the industries use these services provided as per their requirement and pay per use. This paper provides a detailed research on how cloud servers can be efficiently utilized as Hadoop clusters for storing and processing large amount of data. I.INTRODUCTION We are in a decade of digital data known as BIG DATA Data is growing with time and is the heart of any business, scientific search, search engines etc and hence is a very important factor. Nowadays we interact with terabytes, zetabytes or petabytes of data. By using traditional RDBMS we may not be successful to get desired output. This problem can be basically solved by HADOOP. Hadoop makes an efficient use of resource by dividing the data and storing it in separate clusters and processing it efficiently [3]. Big data are very large to operate and manage, to manage such huge data large servers are required [2].Cloud computing provides large infrastructures for storing data and services. This paper illustrates what is Hadoop, cloud computing and how is it possible to combine these two existing technologies to increase the efficiency of data processing and resource utilization. The remainder of the paper is structured as follows. Section II describes HADOOP, Section III describes cloud computing, Section IV describes HADOOP into cloud 74
And section V concludes the paper. II. HADOOP A. Why Hadoop evolved Due to the growing size of data it becomes very difficult to process the data using traditional RDBMS. Relational Data Base Management systems evolved in the year 1970 and are yet being used. But RDBMS is facing a problem of Unit of Analysis whose symptoms are: - 1. Counting of distinct records becomes very difficult if the data is very large. 2. Rdbms also causes Alter Table of Doom i.e. if the table is large and has number of columns then DBMS takes a lot of time to alter the table and if it has Not NULL constraint then it takes more time. 3. Rdbms cannot step row by row in the cursors when the data is very large. 4. Data Merge and Mash up is caused as many business Entrepreneurs online and offline data so the pattern Or structure of these data are different if we want to Have an analytical model then Rdbms cannot Combine structured and unstructured data. These Problems can be solved by Hadoop [3]. B. HADOOP Hadoop is basically a set of complex software s and has more than 200 tunable parameters. Hadoop Framework can run applications on very large clusters of computers. Hadoop is a distributed system that stores and analyzes large amount of data stored at various clusters in order to get meaningful information. It is highly scalable and a fault tolerant system [4]. B. Hadoop Cluster A cluster is a group of independent computer systems, referred to as nodes, working together as a unified computing resource. C.Hadoop Data Model Hadoop does not have a particular data format but it uses various file formats such as text files to data base files which is stored across multiple clusters. To increase the Hadoop processing power you just have to add more servers to Hadoop cluster. Hadoop is faster than other data processing systems as in RDBMS to process that data from multiple sites it is necessary to first convert that data into a rigid tabular format and perform data analysis [6]. The software divides the data into pieces and stores across clusters, it also keeps a track of where the data resides as there are multiple copies of replicated data. Hadoop runs on number of machines that does not share memory or disks hence it is cheaper than distributed data base systems. D. Hadoop Components i. Map Reduce Model A data is processed in two phases by the Map Reduce framework. The HDFS splits the data set into independent chunks which are processed in a parallel manner.. 75
Fig1:- Hadoop data model The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The Map Reduce framework consists of a single master Job Tracker and one slave Task Tracker per cluster-node. When a particular job is submitted the job tracker schedules all jobs and handles it.a task tracker is the manager for all tasks on a given node. A task is executed on an individual map or reducer [5]. ii. Hadoop Distributed File System (HDFS) A file system controls storage and retrieval of data. It organizes the data in the storage area, names the stored Information, and keeps a track of the beginning and end of every record. Hadoop file system is different from other file system which is designed to be deployed on low cost hardware. HDFS have large terabytes or gigabytes of data, it is also a fault tolerant system. HDFS consist of several servers that stores each of the file system [2]. Fig2:-HDFS Architecture HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode[7]. III. CLOUD COMPUTING Cloud computing is a new trend where we can access both the hardware and software across the Internet from outside our environment. This service has to be paid as per our use. 76
This platform hides the complexity from the user.eg:-www.gmail.com (where our data is stored on cloud and we can access it anywhere anytime.) Cloud helps to avail services without investing in expensive Infrastructure. Public cloud is basically operated by service provider or cloud provider over the Internet. Private cloud is owned by a particular organization or a third party. Community cloud is operated by number of organizations and services provided to only some group.these services can be owned by a private organization or cloud service provider. Hybrid computing is a combination of private and community cloud. A. Services provided by cloud In cloud computing services is being reused by several users across the network.cloud provides scalability,multitenancy which allow users to access system on different hardware devices.eg gmail,crm.user can access these services by paying as per use. i. Software as Service(SaaS) A complete application is offered as a service on demand. A single instance of the s/w runs on the cloud and services multiple end-users or clients organization. Eg gmail,crm.user can access these services by paying for it as per use. ii. Platform as Service (PaaS) It offeres development platform over cloud. Encapsulates a layer of software and provides it as a service that can be used to build higher services. Produces platform by integrating on OS, middleware, Application s/w and even develop an environment. iii. Infrastructure as Service(IaaS) Infrastructure as service delivers basic storage and computing capabilities as standardized services over network. involves offering hardware related services using the principles of cloud computing. These could include some kind of storage services (database or disk storage) or virtual servers. iv. Cloud Application Program Interface The cloud infrastructure is programmable[18]. IV. HADOOP INTO CLOUD There are many cloud providers like Amazon,Microsoft,Google that provide enterprise with cloud,the cloud is nothing but array of servers where your huge data can be stored.the main intension of these enterprises is data analysis for which the data from the cloud has to be given as input to another data processing system.hadoop is a data processing system which needs clusters of servers for processing the BIGDATA, this can be easily provided by the cloud. So we can try to collaborate the two existing technologies to provide the same. This is motivating many of the cloud providers to deploy Hadoop into cloud. Hadoop is written in Java, which allows programmers to write Java API or Python code. It provides the application programmer with Map and Reduce functions. The MapReduce capability is available in many languages, such as Java and Python[8]. A. Challenges of Deploying Cloud into Hadoop 1. Hadoop was orginally architected for the world of big iron but cloud infrastructure depend upon virtualization to manage and present an aggregation of infrastructure components. 2. Hadoop data nodes(servers) are installed in racks, Rack contains 77
multiple data servers with a top rack switch which is used in data communication.there should be rack awareness as hadoop has to write data into 3(default) data nodes which are on different physical rack.this data is replicated across multiple racks as it prevents data loss due to data node or rack faliure. 3. Performance of a hadoop cluster may be good with dedicated hardware but the agility of running it in cloud on demand may trump some limitations for some workload. By considering all the above challenges we have to find various ways to deploy hadoop on cloud. B. Deploying hadoop on cloud. Racksack private cloud computing helps quickly implement hadoop on cloud using open source software. OPENSTACK is a open cloud standard which helps to build both private and public cloud.private cloud is basically reserved for your data alone while public cloud contains lots of data hence very crucial and needs more security. APACHE HADOOP processes large amount of structured and unstructured data. Apache projects such as Hive,Pig, HBase etc provide tools to manipulate data. Hadoop was orginally architected for the static predictable infrastructure but virtualization is benifical.[11] C. Cloud creation and Hadoop Installation Rackspace private cloud software (RPCS) is a free and open source software that can be utilized to launch a cloud powered by OpenStack. RPCS provides the same cloud platform that powers Rackspace s public cloud, the largest open cloud deployment in the world[11]. Steps to Install Hadoop 1. You should have 3-4 systems, on a network and each of them running on Ubuntu 9.04+ with sun-java6-jdk and ssh packages installed. Its preferable to use a new system installation, though its not mandatory. 2. Create a new user for Hadoop work. This step is optional. Its recommened, as the path HADOOP_HOME is the same in the cluster. 3. Extract the hadoop distribution on your home folder. 4. Now, repeat the steps in all the nodes. (Make sure the HADOOP_HOME) is the same on all the nodes. 5. We need the IPs of all the 3 nodes. Let them be: 192.168.1.5, 192.168.1.6, 192.168.1.7. Where *.1.5 is the NameNode, *.1.6 is the JobTracker, these 2 are the main exclusive servers. Node *.1.7 is the DataNode, which is used for both Task Tracking and storing Data. 6. You will find a file called: "hadoopsite.xml" under the conf directory of the Hadoop distribution. Copy and paste the following contents between <configuration> </configuration> <property> <name>fs.default.name</name> <!-- IP Of the NameNode --> <value>hdfs://192.168.1.5:9090</value > <description></description> </property> <property> <name>mapred.job.tracker</name> <!-- IP of the JobTracker --> <value>192.168.1.6:9050</value> <description></description> </property><property> </property> 7. Make sure the same is done for all the 78 nodes in the system.
8. Now, to create the Slaves for the NameNode to replicate the data. Go thehadoop_home directory in the NameNode. Under the folder "conf" you should see a file called slaves. 9. Upon opening slaves, you should see a line with "localhost". Add the IPs of all the DataNodes you wish to connect to the cluster here, one per line and start our cluster. 10. Open terminal in the NameNode, go to HADOOP_HOME. 11. Execute the following commands: # Format the HDFS in the namenode $ bin/hadoop namenode -format # Start the Distributed File System service on the NameNode, which will ask you the passwords for itself and All the slaves, to connect via SSH $ bin/start-dfs.sh 12. Your NameNode should start and be running. 13. Now, its the JobTracker Node. Execute the following commands: # Start the Map/Reduce service on the JobTracker $ bin/start-mapred.sh 14. The same process follows for JobTracker. It asks all the password for itself and all its 15. Now, that we're done starting the cluster. Its time to check it out.in the NameNode execute the following command: # Copy a folder (conf) to HDFS - For sample purpose $ bin/hadoop fs -put conf input 16. If you go to http://192.168.1.5:50070, on your browser. U should see the Hadoop HDFS Admin interface. Its a simple interface created to meet the purpose. It shows you the Cluster Summary, Live and Dead Nodes etc. 17. You can browse the HDFS using link on the top-left corner. 18. Go, to http://192.168.1.6:50030, to view the Hadoop Map/Reduce Admin Interface. It displays the current jobs, finished jobs etc. 19. Now check the Map/Reduce Process by Executing some codes.[12] V. CONCLUSION With increase adoption of cloud where data is already stored in cloud servers doing an analysis of data which already in cloud is cost efficient, hence deploying hadoop on cloud can reduce the cost of moving big data from one location to another. As we are in the era of green computing where every enterprise is cutting down cost deploying hadoop on cloud is a great way for the same. Many cloud providers now a days have started working on the same in order to provide customer with hadoop into cloud solution. Many solutions have been already found to run hadoop on cloud and there are many more solutions to come by the end of this year. VI. REFERENCES [1] James Tuner Hadoop Architecture and its data applications 12 jan 2011 [2] Nati Shaloms Blog 28 Aug 2012 [3] Rakesh Rao Big data platforms Blog. When to Use Hadoop Instead of a Relational Database Management System (RDBMS 1/10/13. [4] Miha Ahrozovitz,Kuldip Pable Hadoop. [5]Y.Elababakh Hadoop ramework Spring. [6] Brain Profit The real reason Hadoop is such a bigdeal in big data May 20 2013. [7] HDFS Architecture Guide 8 Apr 2013. [8] Dr Khalil.E Khalib, Aba Yagoub Hadoop@uoit [9] Eugene Ciurana, Apache Hadoop Deployment:A Blueprint for Reliable 79
Distributed Computing [10] Ven Varma Big Data and Analysis Hub http://www.ibmbigdatahub.com/blog/runnin g-hadoop-cloud. [11] Private Cloud team Apache Hadoop on Rackspace Private Cloud [12] Hadoop Cluster Deployment + Step- By-Step Process July 28 2010. http://blog.ashwanthkumar.in/2010/07/hadoo p-cluster-deployment-step-by-step.html [13] Chen Zhang1, Hans De Sterck2, Ashraf Aboulnaga1, Haig Djambazian3, and Rob Sladek3 Case Study of Scienti_c Data Processing on a Cloud Using Hadoop [14] ] Dell Cloudera Solution Reference Architecture v2.1.0 A Dell Reference Architecture Guide November 2012 [15] Sun Microsystems Introduction to cloud computing architecture whitepaper 1 st edition June 2009 [16] Grant ingersoll, Chief Scientist for LucidWorks and ted Dunning, Chief Application Architect for Mapr Crowd Sourcing Reflected Intelligence Using Search and Big Data [17]Open Logic Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud.ppt [18]http://www.thecloudtutorial.com/cloudty pes.html 80