International Journal of Emerging Technology & Research

Transcription

1 International Journal of Emerging Technology & Research High Performance Clustering on Large Scale Dataset in a Multi Node Environment Based on Map-Reduce and Hadoop Anusha Vasudevan 1, Swetha.M 2 1, 2 Computer Science and Engineering, JCT College of Engineering and Technology, Coimbatore, Tamilnadu, Abstract--The amount of data in our world has been exploding, and analyzing large data sets socalled big data will become a key basis of competition, reinforcement new waves of productivity intensification, innovation, and consumer surplus. Big data refers to the size of a dataset that has grown too large to be manipulated through traditional methods. These methods include capture, storage, and processing of the data in a tolerable amount of time. Apache Hadoop is an open-source software framework for storing and processing large scale dataset. It works with Map Reduce software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) inparallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Clustering analysis is unsupervised learning tasks that consist on classify objects into group. This paper shows that Map Reduce framework K-means clustering algorithm can obtain a higher performance when handling large scale document automatic classification in a multimode environment. It reduces outlier data and enhances the speed of the system. Key terms: Bigdata, Hadoop, Map-Reduce, K- means Clustering, HDFS. 1. INTRODUCTION Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications[1].every day we create 2.5 quintillion bytes of data- So much than 90% of the data in the world today has been created in last two years. Although the term big data was once applied to the concept of data warehouses, it now refers to large-scale processing architectures that focus on capacity, throughput, and generosity of processing..big data uses Apache Hadoop as a largescale distributed batch processing infrastructure [2]. This includes hadoop Distributed File System and Map Reduce. The present system which uses traditional database management systems, data warehousing and data mining lacks in time commitment, privacy and security issues etc. Hence there is swift need of a system which can handle big data set which can handle zettabytes of data. 2. APACHE HADOOP The Apache Hadoop is a distributed batch processing infrastructure. While it can be used on a single machine, its true supremacy lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently process large volumes of information by connecting many commodity computers together to Copyright reserved by IJETR (Impact Factor: 0.997) 733

2 work in parallel. Hadoop framework is written in java and the core components of Hadoop includes: Hadoop Distributed file System: Store large data sets Map-Reduce: Process large data sets. The major difference between traditional RDBMS and Hadoop is that RDMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. 3. HADOOP DISTRIBUTED FILE SYSTEM HDFS is an Apache Software Foundation project and a subproject of the Apache Hadoop project. Hadoop is ideal for storing large amounts of data, like yotta bytes and geobytes, and uses HDFS as its storage system. We can then access and store the data files as one seamless file system. Access to data files is handled in a streaming manner, meaning that applications or commands are executed directly using the Map-Reduce processing model. HDFS is fault tolerant and provides high-throughput access to large data sets HDFS ARCHITECTURE HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster consists of a single node, known as a Name Node that manages the file system namespace and regulates client access to files. In addition, data nodes (Data Nodes) store data as blocks within files[8]. Fig:1 shows the HDFS architecture 3.2 NAME NODE Name Node is the most vital of the Hadoop daemons, It is the master of HDFS that directs the slave Data Node daemons to perform the low-level I/O tasks. It is the bookkeeper of HDFS, It keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem. The functions are as follows: Memory and I/O intensive. It keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed file system. It typically does not store any user data nor perform any computations for a Map-Reduce program to reduce the workload on the machine. 3.3 DATA NODE Data Node is the basically part of slave machine in your cluster. It is one of the daemons to perform the grunt work of the distributed file system, reading and writing HDFS blocks to actual files on the local file system. It can read or write a HDFS file (Actually the file is broken into blocks and the Name Node will tell your client which Data Node each block resides in ).Upon initialization, each of the Data Nodes informs the Name Node of the blocks it is currently storing. After this mapping is complete, the Data Nodes continually poll the Name Node to provide information regarding local changes as well as receive instructions to create, move, or delete blocks from the local disk. Your client communicates directly with the Data Node daemons to process the local files corresponding to the blocks. It may communicate with other Data Nodes to replicate its data blocks for redundancy. The Data Nodes provide backup store of the blocks and constantly report to the Name Node to keep the metadata current. HDFS Client DFS 3.4 SECONDARY NAME NODE Name FSDataOutputstream Data Node Data Node Data Node Fig:1 HDFS Architecture. Backup store of the blocks ensures that if any one Data Node crashes or becomes inaccessible over the network, you will still be able to read the files. The Secondary Name node (SNN) is an assistant daemon for monitoring the state of the cluster HDFS, Like the Name node, Each cluster has one SNN, and it typically resides on its own machine as well. No other Data Node or Task Tracker daemons run on the same server. The functions are as follows [13]. Name node is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime Copyright reserved by IJETR (Impact Factor: 0.997) 734

3 and loss of data. Nevertheless, a Name node failure requires human intervention to reconfigure the cluster to use the SNN as the primary Name node. 3.5 JOB TRACKER The Job Tracker daemon is the liaison between your application and Hadoop. Once you submit your code to your cluster, the Job Tracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they are running. If a task fail, the Job Tracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. There is only one Job Tracker daemon per Hadoop cluster. It is typically run on a server as a master node of the cluster. 3.6 TASK TRACKER The Job Tracker is the master overseeing the overall execution of a Map Reduce job [6] and the Task Trackers manage the execution of individual tasks on each slave node. Each Task Tracker is responsible for executing the individual tasks that the Job Tracker assigns. One responsibility of the Task Tracker is to constantly communicate with the Job Tracker. If the Job Tracker fails to receive a heartbeat from a Task Tracker within a specified amount of time, it will assume the Task Tracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster. 4. MAHOUT Mahout is a project of apache software foundation which is mainly used in the areas of clustering. It provides common math operations and primitive java collections. Here we are using the subversion maven It helps in the easy clustering of data. Here for clustering first the data is copied into the HDFS and then it is converted into sequential file and then it undergoes the stage of syntax analysis and then the k-means clustering is applied. The clustering produces several map-reduce iterations and then clustered output is produced. program and distribute it across a cluster. In Map- Reduce, records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together[7]. Fig:2 shows the Map- Reduce Operation Inp ut data part -1 part -N 5.1 MAP FUNCTION Fig:2 Map-Reduce operation The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other. If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and the number of instances of that word in the line as the value. 5.2 REDUCE FUNCTION MAP-REDUCE MapInst ance 1 MapInst ance 2 Redu ce Out put data 5. MAP-REDUCE Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another. Hadoop will not run just any The Frame work calls the application s reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs. In the word count example, the Reduce function takes the input values, sums Copyright reserved by IJETR (Impact Factor: 0.997) 735

4 them and generates a single output of the word and the final sum 6. CLUSTERING A cluster is a group of same or similar elements gathered or occurring closely together[16]. Clustering is one of the most popular tools for data exploration and data organization that has been widely used in almost every scientific discipline that collects data. Given the exponential growth in data generation (estimated to be over 35 trillion gigabytes by the year 2020), clustering is receiving renewed interest and use in applications such as social networks, image retrieval, web search and gene expression analysis.a good clustering method will produce high quality clusters with high intra-class similarity. low inter-class similarity. 6.1 K-MEANS CLUSTERING k-means clustering is a method of vector quantization. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean[10], serving as a prototype of the cluster. Start Number of cluster Centroid Distance objects to centroids Grouping based on minimum distance No objec t Move Fig: 3 Flow Chart: K-means algorithm En d 6.2 K-MEANS ALGORITHM The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters annotations into k groups, where k is provided as an input constraint. It then assigns each surveillance to clusters based upon the observation s proximity to the mean of the cluster [4]. The cluster s mean is then recomputed and the process begins again. Fig: 3 shows the flow of K-means algorithm: 7. INSTALLATION OF MULTINODE HADOOP CLUSTER AND WORKING OF K-MEANS CLUSTERING ALGORITHM Step 1: As a prerequisites of creating multi-node Hadoop cluster two local single node Hadoop cluster has to be configured. This is done in ubuntu OS with the help of Oracle Virtual Box.They are executed using shell commands. During the initial stages of configuration all the necessary software s such as python, java are installed users and groups are created accordingly [9]. The Hadoop HDFS performs the Map-Reduce task which generates a key value pair internally[16]. Figure shows the execution of Map-Reduce operation. Copyright reserved by IJETR (Impact Factor: 0.997) 736

5 International Journal of Emerging Technology & Research Volume 1, Issue 4, May-June, 2014 ( ISSN (E): ISSN (P): Fig 4: Map-Reduce Operation Step 2: After performing Map Reduced task it is copied to HDFS and then local data storage is performed. Step 3: Once the individual system is configured then it is combined together to form a multi node hadoop cluster. The IP addresses for both the systems are configured in such a way that one node become a dedicated master and other become the slave[14]. The command for configuring master and slave is as follows. #sudo gedit /etc/hosts The identification of master and slave is shown in Fig 5 and Fig 6 respectively. The shell command to recognize the java processes in master and slave is $ jps Fig 6: Slave JPS Step 4: Once we ensure that the map-reduce operation is working properly we go for the installation of mahout.here we are using the subversion maven Step 5: For the purpose of clustering first the document has to be copied from the local file sytsem to HDFS and then text documents to be clsuterd are to converted to the sequrntial file. The fig 7 shows the converssion from text to sequential file. Fig 7: Conversion to sequential file Fig 5: Master JPS Copyright reserved by IJETR (Impact Factor: 0.997) 737

6 Step 6: Once it is converted into sequential file it is then clustered using k-means clustering algorithm. $mahout kmeans I /user/sample/tfidf vectors/ -o /user/hduser/sam_out c /user/hduser/sam_m dm org.apache.mahout.common.distance.cosine.meas ure x 10 k 20 ow clustering cl Fig : 8 shows the clustering of files Fig 8: Clustering of the input files 8. CONCLUSION Big Data is used in the singular and refers to a collection of data sets so large and complex, it s impossible to process them with the usual databases. Companies pursue Big Data because it can be revelatory in spotting business trends, improving research quality, and gaining insights in a variety of fields, from IT to medicine to law enforcement and everything in between and beyond. And hence there is a need for highly scalable parallel data processing platforms such as Hadoop where Map-Reduce is a framework for programming commodity computer clusters which perform large-scale data processing which are consequently stored in HDFS with the help of parallel processing algorithm, the K-Means. In this paper we take the benefit of the parallelism of Map- Reduce to design a parallel K-Means clustering Algorithm in a multimode environment based on Map-Reduce. This algorithm can automatically cluster the massive data, making full use of the multiple Hadoop cluster performance and makes the big data analysis part a easier task. 9. REFERNCES [1] rial.html [2] [3]. nition/hadoop-cluster [4] Haixun Wang Wei Wang Jiong Yang Philip S. Yu Clustering by Pattern Similarity in Large Data Sets [5]. 3/iCML03/papers/ramos.pdf [6] Vishal S Patil1, Pravin D. Soni2 HADOOP SKELETON & FAULT TOLERANCE IN HADOOP CLUSTERS [7] Hyeokju Lee, Joon Her, Sung-Ryul Kim Implementation of a Large-scalable Social Data Analysis System based on Map-Reduce. [8] [9] p+clusters+and+the+network [10] Anil K. Jain, Data clustering: 50 years beyond K-means,Pattern Recognition Letters 31 (2010) [11] Dan pelleg, Andrew moore,x-means: Extending K-means with Efficient Estimation of the Number of Clusters [12] [13] ule1.html [14] [15] [16] [17] Clustering_Korea_Sept12.pdf Copyright reserved by IJETR (Impact Factor: 0.997) 738