ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP

ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP Livjeet Kaur Research Student, Department of Computer Science, Punjabi University, Patiala, India Abstract In the present study, we have compared and analyzed the features of java and map/reduce on hadoop. Two algorithms i.e, Maximum Finder and Anagram Finder have been used as the basis of the study. These programs were run on same data size to analyze the performance improvement on different criterions such as storage, processing, networking and efficiency. Conclusion drawn after evaluating the results shows the improvement in execution time when using optimized Map/Reduce Algorithm and for general Map/reduce program and optimized one. 1. Introduction There are lot of applications generating huge amount of data daily. Many companies have collected the data in multiple forms and a huge wealth of data gets accumulated with every passing day. It could be of a great benefit for many stakeholders to dig out the useful information present in this ocean of data production. Hadoop is made to handle these problems. Hadoop is the Apache Software Foundation open source and Java-based implementation of the Map/Reduce framework. Hadoop provides the tools for processing vast amounts of data using the Map/Reduce framework and, additionally, implements the Hadoop Distributed File System (HDFS). Map/Reduce is Google`s programming framework which perform parallel computations on large datasets, for example, web request logs and crawled documents. The huge amounts of data require to be distributed over a number of machines (nodes). Storage sources are contributed equally by each participating node. These nodes are connected under the same distributed file system. Furthermore, over the locally stored same data, same computations are performed by each machine due to which it results in parallel and largescale distributed processing. Copyright to IJASPM, 2015 IJASPM.org 212

HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. To illustrate this with an example take the case of Oil & Gas industry. Service companies like Schlumberger, BHI, and Halliburt on perform the drilling for such industries. Drilling data is collected by these companies through placing sensors on drilling bits and platforms. This data is then made available on their servers. Drilling status is indicated by the real time drilling data. Application of reasoning algorithm on the historical data can provide useful information to the operators. But as time goes by, data gets accumulated to enormity. When the data set reaches a huge size like several gigabytes, it becomes very time consuming or practically impossible to do some reasoning on the huge amount of data. Map/Reduce system is a valuable framework in these kinds of situations. The aim of this study is that the data is growing at exponential rate at present. However, the speed with which data can be read or written to the disk is increasing gradually. In 1990, the data storage capacity of one typical drive was 1370 MB with a transfer rate about 4.4 Mb/s. So the time consumed to read the whole disk was 5 minutes but now disk size of 1 TB is norm with the transfer rate of about 100MB/s. So the time taken to read all the data from the disk now is more than 2 hours. In the present work, we have analyzed the features of java and map/reduce on hadoop platform. In all the programming languages like Java, data is read in the sequence so their processing speeds depend upon the disk transfer rate. But if we talk about Map/Reduce systems they operate on the distributed file system which support parallel processing. So we have tried to find out the performance of various algorithms like Maximum Finder and Anagram Finder. When these programs were run on same data size on java and map/reduce, the performance improvement has been analyzed on different criterions such as storage, processing, networking and efficiency. 2. METHODOLOGY In our work, the data which we have used for processing was in megabytes because of limitations in hardware we used. It is not feasible to use data in Terabytes, Exabytes and Petabytes because in our experimental setup, the node we have used has RAM of only 8 GB and hard disk of only 1 TB. The processor we used was AMD processor We installed the VMware player for virtual machine in our node. On the virtual machine the operating system used was Centos operating system and the Hadoop was installed on virtual machine. The concepts of networking were analyzed on the local host. We have utilized two frameworks Map/Reduce and Java on hadoop platform for comparing the performance. Hadoop has run in pseudo mode and this single node Hadoop cluster has been tested based on 2 scenarios. 1. Time Constraint Copyright to IJASPM, 2015 IJASPM.org 213

2. Data Constraint For the first scenario, optimized Java algorithm and Map/Reduce worst case and best case algorithm have been written and a comparative study has been performed to check how fast Map/Reduce is than the Java program. For the second scenario these algorithms have been tested with different amount of data to check execution time taken by Java and Map/Reduce which showed the CPU utilization. After getting results from both the scenarios, the results were analyzed to evaluate the performance in terms of Processing Power of Map/Reduce best case and worst case with respect to optimized Java algorithm. Algorithm of Anagram Finder 1. Word:=Input word whose anagram is to find 2. Sortedword:=Sort character of given word 3. While(!eof) Do Currentword: =Current word from file Sortedcurrentword: =Sort current word If (sortedword==sortedcurrentword) Print (anagram of word is current word) End if End while Algorithm of Maximum Finder 1. Max :=integer.min 2. While (!eof) ( int current := current number from file If (current gt max) Max: =current End if End while Copyright to IJASPM, 2015 IJASPM.org 214

3. RESULTS AND DISCUSSIONS Comparing Hadoop and Java on Anagram program a. In terms of Processing 1 node Hadoop cluster Table 1: Comparison of Hadoop and Java with Anagram program. Data size in mb Time taken by Hadoop in ms Time taken by JAVA in ms 60 mb 59914 78047 125 mb 131828 158798 250 mb 263656 332188 500 mb 447313 624378 The above table shows the test results of anagram program run on different data sizes in megabytes on Hadoop and Java and the time taken for processing the program is in milliseconds. The inference drawn is that the processing time taken for processing by Hadoop is less as compared to Java. Copyright to IJASPM, 2015 IJASPM.org 215

Graph1.Comparison of Hadoop and Java on anagram program Time in Milli Seconds Data in mega Bytes In the above graph, Y-axis represents time in milliseconds for processing anagram program on Hadoop and Java and x-axis shows the data size of the file in megabytes. From the graph it is clear that the time taken by Hadoop in processing is less as compared to Java. As the sample data size increases the difference between the processing time of Hadoop and Java also increases as processing speed in Hadoop is better in processing large data size and the processing speed of Java decreases when the data size increases. Comparing Hadoop and Hadoop Optimized Algorithm In terms of Processing 1 node cluster Table2. Comparison of Hadoop and Hadoop Optimized with Anagram program Copyright to IJASPM, 2015 IJASPM.org 216

Data size Time taken by Hadoop in ms Time taken by Hadoop Optimized in ms 60 mb 59914 41764 125 mb 131828 73327 250 mb 263656 146655 500 mb 447313 253309 In this table, we have run anagram program of different data sizes in megabytes on Hadoop and Hadoop Optimized and we have shown the time taken in milliseconds for processing the program on different size of file. We have concluded that the processing time taken by Hadoop Optimized in milliseconds is less as compared to Hadoop. Graph2.Comparison of Hadoop and Hadoop Optimized with Anagram program Time in Milli Seconds Copyright to IJASPM, 2015 IJASPM.org 217

Data in mega Bytes In this graph, on y-axis we have time in milliseconds for processing anagram program on Hadoop and Hadoop optimized. on x-axis we have taken data size of the file in megabytes. As shown in graph, the time taken by Hadoop optimized on processing is less as compared to Hadoop. As the data size increases the difference between the bar of Hadoop and Hadoop optimized also increases because the Hadoop optimized processing speed is better on large data size as compared to Hadoop. Comparing Hadoop Optimized Algorithm and Java Algorithm In terms of processing Table3.Comparison of Java and Hadoop Optimized with Anagram program Data size Time taken by Hadoop Optimized in ms Time taken by JAVA in ms 60 mb 125 mb 250 mb 500 mb 41764 78047 73327 158798 146655 332188 253309 624378 In this table, we have run anagram program of different data sizes in megabytes on Hadoop Optimized and Java and we have shown the time taken in milliseconds for processing the program on different size of file. We have concluded that the processing time taken by Hadoop Optimized in milliseconds is less as compared to Java. Copyright to IJASPM, 2015 IJASPM.org 218

Graph3.Comparison of Java and Hadoop Optimized with Anagram program Time in Milli Seconds Data in mega Bytes In the above graph, on y-axis we have time in milliseconds for processing anagram program on Java and Hadoop optimized. On x-axis we have taken data size of the file in megabytes. As shown in graph, the time taken by Hadoop optimized on processing is less as compared to Java. As the data size increases the difference between the bar of Java and Hadoop optimized also increases because the Hadoop optimized processing speed is better on large data size as compared to Java. b.) In terms of Efficiency Hadoop is more efficient than Java in terms of processing because of following reasons : i) Parallel Processing rather than sequential processing ii) Movement of code rather than movement of data Parallel Processing :Hadoop uses MapReduce Programming which uses the concept of parallel processing of algorithm rather than sequential processing which Java supports. Movement of Code rather than data: In Java data is moved to the code for processing but in case of Hadoop code is moved to data due to concept of Data Locality. Copyright to IJASPM, 2015 IJASPM.org 219

c.) In terms of storage Hadoop is more efficient than Java in terms of storage because of following reasons: i) Fault Tolerance ii) Data Recovery iii) Back up Fault Tolerance: Hadoop is designed to handle network faults, System crashes and so on in a very diligent way. Data Recovery: When every your node in a cluster crashes(system) or theirs is some delay in the response due to network congestion. Hadoop tries to get the data or response from other system in the cluster. Back Up: Hadoop itself do the back up of your data. By default file saved onto Hadoop is divided into a block of 64 mb and each block is replicated 3 times onto different nodes in the cluster depending upon the replication factor. d.) In terms of Networking Java algorithms can be designed to run on the system of computers(networks) but it would always run slow as compared to Hadoop because of following reason : i) Data Locality Data Locality: In a traditional programming languages(sequential/object oriented) it is data that is always moved toward the code. So if we have networks of computers movement of data between computers can become a hectic task but in case of Hadoop your job is always tried to run over the node which is storing your data so there is always less movement of code.this concept is called Data Locality 4. CONCLUSION A conclusive study has shown that Hadoop gives better results than java when the data size is large. But for single node and small data size, java performs better because when working on a single node, Hadoop cannot work parallel which decreases the processing time of Hadoop. Our experimental results have shown a 40-50% improvement in execution time when using optimized Map/Reduce Algorithm and 15-30% improvement in execution time for general Map/reduce program and optimized one. There are several threads running as a background process. Map/reduce programs are also executed by daemons. The processes i.e. jobtracker and tasktracker running on the system as well as Copyright to IJASPM, 2015 IJASPM.org 220

on a cluster of single node use the concepts of networking. So, due to some internal networking conjunction and priorities to different threads a small variance in the execution time has been observed. REFERENCES 1. Liu, J., Ravi, N., Chakradhar, S., &Kandemir, M. (2012, March). Panacea: towards holistic optimization of MapReduce applications. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (pp. 33-43). ACM. 2. Jahani, E., Cafarella, M. J., &Ré, C. (2011). Automatic optimization for MapReduce programs. Proceedings of the VLDB Endowment, 4(6), 385-396. 3. Dittrich, J., Quiané-Ruiz, J. A., Jindal, A., Kargin, Y., Setty, V., &Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2), 515-529. 4. Yan, J., Yang, X., Gu, R., Yuan, C., & Huang, Y. (2012, November). Performance Optimization for Short MapReduce Job Execution in Hadoop. In Cloud and Green Computing (CGC), 2012 Second International Conference on (pp. 688-694). IEEE. 5. Lin, X., Tang, W., & Wang, K. (2012, December). Predator an experience guided configuration optimizer for Hadoopmapreduce. In Proceedings of the 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom) (pp. 419-426). IEEE Computer Society. 6. Lu, H., Hai-Shan, C., & Ting-Ting, H. (2012, October). Research on Hadoop Cloud Computing Model and its Applications. In Networking and Distributed Computing (ICNDC), 2012 Third International Conference on (pp. 59-63). IEEE. 7. El-Helw, I., Hofman, R., &Bal, H. E. (2014, November). Scaling MapReduce vertically and horizontally. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for (pp. 525-535). IEEE. Copyright to IJASPM, 2015 IJASPM.org 221

8. Dittrich, J. and Quiané-Ruiz, J. (2012, August). Efficient big data processing in Hadoop MapReduce. In Proceedings of the VLDB Endowment. 5 (12), (pp.2014-2015).acm 9. Joshi, S. B. (2012, April). Apache Hadoop performance-tuning methodologies and best practices. In Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering (pp. 241-242). ACM. 10. Sandholm, T., & Lai, K. (2009). MapReduce optimization using regulated dynamic prioritization. ACM SIGMETRICS Performance Evaluation Review, 37(1), 299-310. 11. Lin, M., Zhang, L., Wierman, A., & Tan, J. (2013). Joint optimization of overlapping phases in MapReduce. Performance Evaluation, 70(10), 720-735. 12. Sakr, S., Liu, A., &Fayoumi, A. G. (2013). The family of MapReduce and large-scale data processing systems. ACM Computing Surveys (CSUR), 46(1), 11. 13. Chen, R., & Chen, H. (2013). Tiled-MapReduce: Efficient and flexible MapReduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 3. 14. Jiang, D., Ooi, B. C., Shi, L., & Wu, S. (2010). The performance of MapReduce: an in-depth study. Proceedings of the VLDB Endowment, 3(1-2), 472-483. 15. Elmeleegy, K. (2013). Piranha: Optimizing short jobs in Hadoop. Proceedings of the VLDB Endowment, 6(11), 985-996. 16. Iu, M. Y., &Zwaenepoel, W. (2010, April). HadoopToSQL: a mapreduce query optimizer. In Proceedings of the 5th European conference on Computer systems (pp. 251-264). ACM. 17. Avraham Shinnar, David Cunningham, Vijay Saraswat, and Benjamin Herta. (2012,August). M3R: increased performance for in-memory Hadoop jobs. In Proceedings of the VLDB Endowment. 5(12), (pp.1736-1747).acm 18. Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., & Huang, Y. (2014). SHadoop: Improving MapReduce performance by optimizing job execution Copyright to IJASPM, 2015 IJASPM.org 222

mechanism in Hadoop clusters. Journal of Parallel and Distributed Computing, 74(3), 2166-2179. 19. Lama, P., & Zhou, X. (2012, September). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In Proceedings of the 9th international conference on Autonomic computing (pp. 63-72). ACM. 20. Ahmad, F., Chakradhar, S. T., Raghunathan, A., &Vijaykumar, T. N. (2012, March). Tarazu: optimizing MapReduce on heterogeneous clusters. In ACM SIGARCH Computer Architecture News (Vol. 40, No. 1, pp. 61-74). ACM. Copyright to IJASPM, 2015 IJASPM.org 223