ISSN: Keywords: HDFS, Replication, Map-Reduce I Introduction:

Transcription

1 ISSN: Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor, Department of CSE, SSN College of Engineering, Chennai, India Abstract: In Distributed file systems, files are stored across multiple storage nodes. A file is partitioned into number of chunks and allocated into different-nodes. Chunks are transparently accessed by clients irrespective of chunk s location in distributed file system. To analyze large number of data in a DFS system, a Hadoop DFS can be incorporated. Map-reduce program is a component of Hadoop framework. Map-reduce tasks can be performed in parallel over the nodes for accessing a data chunk. In real time Hadoop distributed file systems, failure of a node or data chunk is inevitable. To improve data availability in distributed file systems, dynamic data chunk replication is needed. The strategy of data chunk replication is to adapt according to the changes of user behavior. Data migration in the context of data availability can be used. However, it will consume more transferring time between nodes. The proposed system deals with dynamic data chunk replication strategy for improving data availability in hig h performance computing applications. Keywords: HDFS, Replication, Map-Reduce I Introduction: Nowadays, in various HPC (High Performance Computing) applications, big data analysis has become an important part of global climate changes, high energy physics, etc. The volume of data stored for such HPC applications are in terms of terabytes and computation of the huge data takes more time because the data is stored in different storage nodes. Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop, an open source software framework developed for reliable, scalable, distributed computing and storage, is successfully used by many companies like Facebook, Amazon, etc. Hadoop is used to analyze the data in multi nodes simultaneously. Files in Hadoop split into block-sized chunks, and the default size of a block is 64MB. Blocks of a single file are stored as independent blocks and these blocks are analyzed by two phases in Hadoop. Map phase filters the relevant data for analysis from huge data set. The reduce phase accepts the filtered data as input from the map phase and refines the data based on the requirement in the reduce phase. Despite its high performance, during processing all data is not present in the same data node. So multiple storage nodes need to be accessed remotely to process the data. This remote data node access can be reduced by storing the data in the required storage node. This paper proposes a technique for reducing this remote storage node access using dynamic data replication in multiple storage nodes based on the MapReduce access patterns. II Related work: Data-Centric Scheduler in HPC application: Saba Sehrish [1] et.al demonstrated a data centric scheduler to improve the performance of HPC analytics Map Reduce function by data locality. They incorporated some features such as analyzing data without running multiple data pre-processing and reorganization of data into intensive file system. They have shown 33 percent throughput improvement in one real application and 70 percent in an I/O kernel of another application by using the Map Reduce with Access Patterns (MRAP). These results greatly reduced the overhead of writing multiple Map Reduce programs to pre-process data before its analysis. 1

2 Pre-fetching based dynamic data replication algorithm in grid: Nazanin saadat [2] et.al proposed an algorithm named PDDRA, which is based on an assumption that members in the Virtual Organization have similar interests in files. By using this algorithm, it predicts future needs of grid sites and pre-fetches a sequence of files to the requestor gird site, so that next time the required files in a site would be locally available. This Popularity Based Replica Placement algorithm decreases data access time by dynamically creating replicas for popular files. This algorithm consists of three phases. Phase 1 is used for storing the file access patterns, phase 2 is meant for requesting a file and performing replication and prefetching, and phase 3 is for replacement. This technique has decreased the response time, access latency, and bandwidth. In addition, it has increased the system performance in grid technology. The proposed work is to use the data replication methodology in Hadoop distributed file system to reduce the cost of using grid technology. Dynamic data replication strategies on a survey: Tehmina Amjad [3] et.al had done a survey on issues in data replication and different replication techniques, to find out which attributes are addressed and which are ignored in grid environment. This paper described about peer to peer, multi tier with centralized and decentralized systems, hybrid mult i- tier sibling tree architecture with centralized and decentralized controls, grid topology, single-tier and graph architecture and QoS aware data replication. They covered the issues namely dynamic nature, grid architecture, decision making, available storage space, and cost of replication. A tabular representation of all parameters used for dynamic replication techniques showed the advantages and drawbacks in grid. Thus the decision making issues involved in this survey, motivated to process the same concept in Hadoop Map reduce for dynamic replication. Minimized remote access on Map Reduce Cluster: Prateek Tandon [4] et.al investigated an intelligent study on data placement policy called partitioned data placement, which works as an avenue to reduce the number of remote data accesses and the associated performance degradation. During this investigation, they found out that data placement can reduce the number of remote data accessed by as much as 86% when a job is restricted to execute on only one-third of the nodes in the cluster. Hence, the results from the work have greatly restricted the long running jobs and their partitioned data placement substantially reduces the remote access rate by random placement in clusters. Proposed System: The architecture shown below contains files that are stored at different data nodes in the system. All data nodes are connected through a name node. A map task is initiated at the name node and the required data is accessed from the data node for processing of the data. The most accessed block is replicated in the required data node to improve the overall performance of the map task. Client2 Replication of data Name Node popularity data Data node1 Data node2 Data node3 Data node4 client1 Figure 1: III System Arch itecture data chunk blocks New Replication of popularity data Name Node: It provides infrastructure to store metadata in HDFS. The name node maintains table, which maps files to block-id location in HDFS. Data Node: It stores files which are scheduled by NameNode. The data node periodically updates the file status to name node. Job Tracker: Job Tracker runs in Name node of HDFS. The Job Tracker is aware of block-id location in HDFS. This location awareness is used for scheduling a map-reduce task to Task Tracker. Task Tracker: Task Tracker runs in data node of HDFS. It executes a map-reduce task which is scheduled by Job Tracker. It updates map-reduce task status to Job Tracker. If Task Tracker fails, Job Tracker assigns map-reduce task to another Task Tracker for executing map-reduce task. 2

3 For instance, blocks <A,8>, <B,5> and <C,2> indicate blocks A, B and C, have been accessed 8, 5 and 2 times respectively in data node1 and blocks <A,6>, <B,2> and <C,2> indicate blocks A, B and C have been accessed 6,2 and 2 times respectively in data node2, the result after aggregating is <A,14>, <B,7> and <C,4> in first time interval and <A,7>, <B,4> and <C,2> in second time interval. These two aggregate result is used finding out Access Frequencies of data block in cluster. The block which gets most Access Frequency is called Popularity block in cluster. Different weights are assigned at different time intervals. Figure 3 shown below for how to calculate Access Frequency of data block. Figure 2: IV DFD of Proposed System Based on the frequency of file being accessed the popularity of data is analyzed in the cluster. The replication factor data is analyzed based on the popularity of the data. A threshold value is set and whenever the access frequency is greater than the threshold the file is replicated. The replicated data is placed in the suitable node in the cluster for improving performance of MapReduce operation in Hadoop. Popularity of Data: In first phase, access records from all data nodes are required for calculating access frequency (AF) to give preference for recently accessed data blocks in the cluster. AF is calculated at different intervals. Assume N t is the number of time intervals, F is the set blocks accessed by map-reduce function, and a f indicates the number of accesses for block F at time interval t. The access frequency of block is calculated as. AF (F) = f * 2 (Nt-t) ) Each data node maintains a detailed record for each file. The record is stored in the form of <filename, number of access>. Each data node sends its record to name node at periodic intervals. All records in data node are summarized by name node. Figure 3: An experiment result for aggregation of access frequency Access Frequency of data blocks: AF (A) = 14* *2 0 = 14 AF (B) = 7* *2 0 = 7 AF (C) = 4* *2 0 = 4 Threshold: Threshold specifies whether the current replication factor of data block is sufficient and also increase of replication factor of the data block in cluster for improving the availability of the data block. Threshold value changes dynamically according to 3

4 the access frequencies of all data blocks in the cluster. Threshold (TH) calculated by replication of data blocks (RF), data block size (DS), and access frequencies of data block (AF) is shown below. TH= If the AF of the data block value is greater than threshold value then that data block is most frequently accessed by map-reduce function and those data blocks are needed to be replicated more for improving availability of the data blocks. Access frequency of data blocks A and B have greater threshold and data block C has lesser threshold. Data blocks A and B are triggered for increasing the replication factor in cluster. Number of Replicas: The number of replicas were found by taking the average AF of popularity data block in per time interval with the access frequency of all other data blocks. Number of replicas is calculated as shown below NR = AF avg (P) AF avg (F) AF avg (P) = AF AF avg (F) = N t AF N t * F Data Block A and B are triggered for increasing the replicas. The number of new replicas of data blocks A and B is 1 for improving availability of data block to map-reduce function in cluster. Placement of new Replicas: To improve availability of data blocks locally in storage node and to access data blocks using mapreduce function in the cluster, new replicas should be stored wherever it is mostly accessed. The storage space is reserved in the data node only if it is chosen for the placement of new replicas. V Results: Experiments were conducted in the lab environment with i5 processor, 8GB RAM, 500GB HDD running Ubuntu in all the nodes. Four systems were used with one acting as name node and the other three nodes acting as data nodes. Map task was run with data available remotely and locally and the access times for those tasks were compared. Table 1: Data block remotely accessed from Data node1. Table 2: Data block locally accessed from Data node2. VI. Conclusion and Future Work: The proposed system identifies the popularity of data blocks corresponding to the files in Hadoop cluster. Based on the popularity of data chunks, the replication of data chunk is indentified. New replicas are placed in the suitable data nodes depending on the access frequency. Dynamic data replication strategy helps to solve the data locality issues and thereby improving availability and performance of the Hadoop. This can also be extended to accommodate various applications which requires high availability with better performance. References: [1].Saba Sehrish, Grant Mackey, Pengju Shang, Jun Wang, John Bent " Supporting HPC Analytics Applications with Access Patterns Using Data Restructing and Data-Centric Scheduling Techniques in MapReduce ", IEEE Transactions on Parallel and Distributed Systems, vol.24, NO.1, January [2].Nazanin Saadat and Amir Masoud Rahmani, PDRA: A new pre-fetching based dynamic data replication algorithms in data grids, Future Generation Computer Systems 28(2012) November 2011 [3].Techmina Amjad, Muhammed Sher and Ali Daud,"A Survey of dynamic replication strategies for improving data availability in data grids ", future generation computer system, 28(2011) , July [4].Prateck Tandon, Michael J. Cafarella and Thomas F. Wenish, Minimizing Remote Access in 4

5 MapReduce Clusters, IPDS Workshops, Page , IEEE (2013). [5].Da-Wei Sun, Gu i-ran Chang, Shang Gao, Li- Zhong Jin and Xing-Wei Wang," Modeling a Dynamic Data Replication Strategy to Increase System Availability in Cloud Computing Environments, Journal of Computer Science and Technology27 (2): March [6].Houda Lamehamedi, Boleslaw Szymanski, Zujun Shentu and Ewa Deelman Data Replication Strategies in Grid Environments, IEEE Computer society press, October [7].Qiu Zhil and Lin Zhao-wen, "Research of Hadoop-based data flow Management System", m/science/journal/ , january