, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing, Korea Institute of Science and Technology Information, Korea, 2 Dept. of Computer Science, Korea National Open University, Korea, 3 Division of General Education, Seokyeong University, Korea, 1 jungha07@kisti.re.kr, 2 jaehwachung@knou.ac.kr, 3 daelee@skuniv.ac.kr Abstract Hadoop distributed file system (HDFS) is designed to store huge data set reliably, has been widely used for processing massive-scale data in parallel. In HDFS, the data locality problem is one of critical problem that causes the performance decrement of a file system. To solve the data locality problem, we propose an efficient data replication scheme based on access count prediction in a Hadoop framework. By the previous data access count, the existing data replication scheme predicts the next access count of data files using Lagrange s interpolation. Then, the proposed data replication scheme determines the replication factor with the predicted data access count, whether it generates a new replica or it uses the loaded data as cache selectively. Finally, the proposed scheme provides improvement of data locality. By performance evaluation, proposed efficient data replication scheme is compared with default data replication setting of Hadoop that shows proposed scheme reduces averagely 8.9% of the task completion time in the map phase. Regarding the data locality, proposed scheme provides the increase of node locality by 6.6% and the decrease of rack and rack-off locality by 38.9% and 56.5%. Keywords: Hadoop, Data locality, Access Prediction, Data Replication, Data Placement 1. Introduction Because of requirement of efficient programming model for large scale data in a distributed computing environment, MapReduce is developed by Google [1]. Hadoop is one of the open sources implementation of the MapReduce. Hadoop Distributed File System (HDFS) is developed by Apache Foundation that provides advantages of the power of high-speed computing clusters and storage and high performance in big data storage [1-5]. Using HDFS, high availability and fault-tolerance such as replication are provided. In HDFS, to provide data locality, Hadoop tries to automatically collocate the data with the computing node. Hadoop schedules Map tasks to set the data on same node and the same rack. This is data locality that is a principal factor of Hadoop s performance. [1, 6]. In Hadoop scheduling policy, there is the case of the data locality problem that can occur, when the assigned node load the data block from another node. The main factor of data locality in Hadoop refers to the distance between data and the assigned node. There are three types of data locality in Hadoop as follows: (1) Node locality: when data for processing are stored in the local storage, * Corresponding Author ISSN: 1738-9984 IJSEIA Copyright c 2015 SERSC
(2) Rack locality: when data for processing are not stored in the local storage, but another node within the same rack, (3) Rack-off locality: when data for processing are not stored in the local storage and nodes within the same rack, but another node in a different rack. In this paper, we propose an efficient data replication scheme in a Hadoop framework based on access count prediction. Proposed data replication scheme focused on improving the data locality in the map phase, and reducing the overhead of data transfer that increases total execution time. In proposed data replication scheme, we efficiently determined the increasing replication factor and avoiding the unnecessary data replication. The contributions of this paper are as follows. Proposed scheme optimizes and maintains the replication factor effectively by proposed predictor. Proposed scheme minimizes the data transfer load between racks by proposed data replication algorithm. Proposed scheme reduces the processing time of MapReduce jobs by improvement of data locality. The rest of this paper is organized as follows. In Section 2, we discuss previous works on data locality in a MapReduce framework and introduce problems of data locality. The proposed efficient data replication scheme based on access count prediction is presented in Section 3. Section 4 shows the performance evaluation of proposed scheme. Finally, Section 5 gives our conclusions. 2. Related Works 2.1. Previous Works There are several previous researches of data replication in HDFS. Improving fault tolerance, [6-9] use data replication in HDFS. They are mainly focused on fault tolerance to overcome unexpected failures. Recently, some of the research [10-12] are focused on improving data locality for efficient execution on data replication in Hadoop. Also, some scheduled research [13-14] are proposed to improve data locality. Table 1 shows feature of data replication schemes and scheduling methods. Table 1. Previous Data Replication Schemes and Scheduling Methods Data Replication Scheme Scheme Technique Weakness [10] dynamic data replication access patterns remove replicated data [11] data placement balancing for requirement depend on application [12] data prefetching prediction by log depend on log data [13] data locality aware of scheduling Scheduling Methods waiting time estimation / data transfer time [14] delay scheduling base on delay 2.2. Data Locality Problem This section describes the data locality problem and types of data locality in Hadoop. Data locality related with the distance between data and the processing node. So, if the 178 Copyright c 2015 SERSC
closer distance between data and node, it has the better data locality. Figure 1 shows three types of data locality in Hadoop: node locality, rack locality, and rack-off locality. Figure 1. Example of Data Locality The data locality problem can be defined as a situation that a task is scheduled with rack or rack-off locality. It could cause poor performance. Regarding the data locality, the overhead of rack-off locality is greater than overhead of rack locality. To prevent the data locality problem, we propose an efficient data replication scheme using prediction by the access count of data files and a data replication placement algorithm reducing case of rack and rack-off locality. 3. Efficient Data Replication Scheme The diagram of a MapReduce is shown at Figure 2. The proposed modules are marked with red dotted rectangle in HDFS. 3.1. Access Count Prediction Figure 2. Diagram of Efficient Data Replication The basic idea of determining replication is based on the different replication factor per data file. Too much replication does not always guarantee the better data locality. However, the probability with node locality is getting higher when the replication is enough to access. To determine the efficient replication factor, a prediction method is demanded to forecast the next access counts of data. To accomplish this work, the amount of access counts that changes over time could be expressed as a mathematical formula. Copyright c 2015 SERSC 179
However, because of randomly access to a data, a constant function is not considerable. Therefore, we apply Lagrange s interpolation using a polynomial expression to extract predicted access count of data. The mathematical formula is given below: In the equation (1), Let N be the number of points, Let x i be the the i th point, and Let f i be the function of x i. To calculate the predicted access count, substitute x by time t, where t is time of access occurred y by an access count at t. Table 2 shows proposed access count prediction algorithm. In this algorithm, t i is the time at which i th access is made, avg is the average time interval between accesses, and AccessList is the access count at t i. AccessPrediction(AccessList[ ]) /* Step 1. Initialization of the variables */ int sum = 0, Threshold = 0, temp=0; float tempx = 1.0; Table 2. Access Count Prediction Algorithm /* Step 2. Calculation of the average time interval */ for (i = 0; i < n; i++){ // n is the number of time stamps. temp = t i+1 t i ; sum = sum + temp; avg = sum / n /* Step 3. Prediction of the next access */ t next = t n-1 + avg /* Step 4. Calculation of the number of future access */ For (i = 0; i < n; i++){ For (j = 0; j < n; j++){ if (i =! j){ temp = (t next t j )/( t i t j ); tempx =tempx x temp; temp = tempx x AccessList[i]; Threshold = Threshold + temp; return Threshold; (1) 180 Copyright c 2015 SERSC
3.2. Efficient Data Replication and Replica Placement This subsection describes the efficient data replication algorithm that based on access count prediction. The proposed algorithm compares the access count by the current replication factor and the accessed replication factor by prediction. Furthermore, we present the replica placement algorithm to effectively reduce the number of nodes with rack or rack-off locality. Table 3 shows the efficient data replication algorithm based on access count prediction. F i means the i th file, Demand i means the demand count of the i th file, and replica i means the replication factor of the i th file. AdaptiveDataReplication( ) /* Step 1. Requesting a task */ if(# of TaskTracker s idle slots >= 1){ request a task to JobTracker; Table 3. Efficient Data Replication Algorithm /* Step 2. Checking the data locality of tasks */ if( # of tasks with the node locality >= 1){ assign a task of them; else if(# of tasks with the rack locality >= 1){ assign a task of them; else{ assign a task with the rack-off locality; /* Step 3. Increasing the number of accesses */ if( the assignment in Step 2 is the first assignment for the job){ for(i =0; i > F; i++) { // F is the number of job files. Access i = Access i + 1; // Access i is the access count of i-th file. /* Step 4. Obtaining the value of Threshold */ Threshold = AccessPrediction(AccessList); // AccessList includes from Access 0 to Access n. /* Step 5. Creating the caches or replicas */ for (i =0; i > F; i++){ if(replica i >= Threshold){ create some of the cache when task has no node locality; else if(replica i < Threshold){ create the replica of the corresponding file; Table 4 shows the replica placement algorithm for data locality. In this algorithm, Rack i means the i th rack, Rack selected means the current selected rack, node inturen means the current selected node, and Replica n means the n th replica. Propose replica placement algorithm is focused on improving data locality especially for the rack-off locality. Copyright c 2015 SERSC 181
Table 4. Replica Placement Algorithm ReplicaPlacement( ) /* Step 1. Selection of a rack to store the replica*/ for(rack i belong to the circular linked list of racks){ // R is the number of racks. if(replica n does not exist in the Rack i ){ Rack seleted = Rack i ; goto Step 2; if(all the racks have Replica n ){ Rack seleted = the first rack to be searched; /* Step 2. Selection of a node to store the replica */ select the Node inturn belong to the circular linked list of nodes on the Rack selected ; /* Step 3. Store of the replica */ store Replica n to Node inturn on the Rack selected ; register the information of Replica n to NameNode; 4. Performance Evaluation 4.1. Evaluation Environment In evaluation, the Hadoop cluster composes one master node and eight slave nodes and Hadoop-0.20.2. Each node consists with Intel Core i5 CPU and 8GM RAM. Within a single rack, nodes are connected by Giga bit Ethernet switches. And, between racks, fast Ethernet routers are used. We conduct the wordcount application with variegation of input data sizes: 1.3GB, 1.9GB, 2.5GB, 3.2GB, and 4.4GB. Based on the logs of real job trace, we evaluate our efficient data replication scheme compared with the default setting of Hadoop replication. Table 4. Configurations of Simulation Cluster configuration HDFS configuration Hadoop configuration Number of master node 1 Number of slave nodes 100 Number of racks 3 Number of replicas 3 Block size 64MB File1 size 3.2GB File2 size 1.9GB File3 size 1.3GB File4 size 2.5GB File5 size 4.4GB Scheduler Fair scheduler Number of concurrent jobs 6 Number of shared files 5 182 Copyright c 2015 SERSC
4.2. Performance Results Figure 3 shows comparison of the map phase completion time of map phase between proposed scheme and Hadoop default. For 6 jobs, 216 map tasks are spawned. The average completion time of the map phase in Hadoop default is 90.5 seconds, whereas average completion time of proposed scheme is 81 seconds that shows 8.9% of performance improvement. Figure 3. Comparison of the Completion Time of Map Phase between Proposed Scheme and Hadoop Default Figure 4 shows the number of map tasks with node, rack, and rack-off locality. In comparison with the Hadoop default, proposed scheme provides the increase of node locality by about 4.5% and the decrease of rack and rack-off locality by about 11.6% and 20.9%. Figure 4. Comparison of Data Locality between Proposed Scheme and Hadoop Default Figure 5, 6 shows the number of map tasks with data locality with variegation of slave nodes. The largest enlargement of node locality takes place when the number of slave nodes is 130, in comparison with the Hadoop default. In this case, node locality is increased by about 6.6% and rack and rack-off locality is decreased by about 38.9% and 56.5%. Copyright c 2015 SERSC 183
Figure 5. Comparison of Node Locality with Variegation of Nodes 9-a Figure 6. Comparison of Rack/Rackoff Locality with Variegation of Nodes Figure 7 shows comparison of the completion time of map phase with variegation numbers files. The number of Map tasks are 52 for 1 shared file, 83 for 2 shared file, 104 for shared 3 file, 144 for 4 shared file, and 216 for 5 shared file. Figure 7. Comparison of the Completion Time of Map Phase with Variegation of Files 184 Copyright c 2015 SERSC
6. Conclusion To solve the data locality problem, we proposed an efficient data replication scheme in a Hadoop framework. Proposed efficient data replication scheme aims at improving the data locality in the map task phase, and reducing the total processing time. By prediction of access counts of data file, we optimize the replication factor per each data file. Proposed data replication scheme determines generating new replica or using the loaded data as cache. By performance evaluation, we prove three major advantages of proposed scheme. First of all, proposed scheme optimizes and maintains the replication factor effectively. And, proposed scheme minimizes the data transfer load between racks. Finally, proposed scheme reduces the processing time of MapReduce jobs. Acknowledgments This Research was supported by Seokyeong University in 2012. References [1] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, (2008), pp.107-113. [2] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", (2012). [3] D. Borthakur, "HDFS architecture guide", HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf, (2008). [4] A. Thomasian and J. Menon, "RAID5 performance with distributed sparing", Parallel and Distributed Systems, IEEE Transactions on, vol. 8, no. 6, (1997), pp. 640-657. [5] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters", Communications of the ACM, vol. 51, no. 1, (2008), pp. 107-113. [6] "Hadoop" [Online]. Available: http://hadoop.apache.org/. [7] K. Shvachko, "The hadoop distributed file system", Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, (2010). [8] S. Mahadev, "A survey of distributed file systems", Annual Review of Computer Science, vol. 4, no. 1, (1990), pp. 73-104. [9] Q. Wei, "CDRM: A cost-effective dynamic replication management scheme for cloud storage cluster", Cluster Computing (CLUSTER), 2010 IEEE International Conference on. IEEE, (2010). [10] J. Xiong, "Improving data availability for a cluster file system through replication", Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, (2008). [11] Abad, L. Cristina, Y. Lu, and R. H. Campbell, "DARE: Adaptive data replication for efficient cluster scheduling", Cluster Computing (CLUSTER), 2011 IEEE International Conference on. Ieee, (2011). [12] Khanli, L. Mohammad, A. Isazadeh, and T. N. Shishavan, "PHFS: A dynamic replication method, to decrease access latency in the multi-tier data grid", Future Generation Computer Systems, vol. 27, no. 3, (2011), pp. 233-244. [13] S. Seo, "HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment", Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on. IEEE, (2009). [14] X. Zhang, "An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments." Cloud and Service Computing (CSC), 2011 International Conference on. IEEE, (2011). [15] M. Zaharia, "Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling", Proceedings of the 5th European conference on Computer systems. ACM, (2010). [16] T. White, Hadoop: The definitive guide, "O'Reilly Media, Inc.", (2012). Authors Jungha Lee received her B.E. in Information and Communication Engineering from the Seokyong University, Korea, in 2011. She received the M.S. in Computer Education from the Korea University, Korea, in 2013. Since 2013 she is a researcher in the Supercomputing Service Center, Korea Institute of Science and Technology Information(KISTI). Her research interests lie in distributed file systems, high throughput computing, and cloud computing, Copyright c 2015 SERSC 185
Jaehwa Chung is an assistant professor at Dept. of Computer Science in Korea National Open University. He received M.S. and Ph.D. degrees at Dept. of Computer Science Education in Korea University, Korea. His research interests include spatial query and index, spatio-temporal database, mobile data management, locationbased services, Spark, WSNs and mobile data mining. Daewon Lee received his B.S. in division of Electricity and Electronic Engineering from Soonchunhyang University, Asan, ChungNam, Korea in 2001. He received his M.E. and Ph.D. degrees in Computer Science Education from Korea University, Seoul, Korea in 2003 and 2009. He is currently a full time lecturer in the Division of General Education at SeoKyeong University in Korea. His research interests are in IoT, Mobile computing, Distributed computing, Cloud computing, and Fault-tolerant systems. 186 Copyright c 2015 SERSC