ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP



Similar documents
Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Architecture. Part 1

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Survey on Scheduling Algorithm in MapReduce Framework

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Open source Google-style large scale data analysis with Hadoop


CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

International Journal of Innovative Research in Computer and Communication Engineering

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop Scheduler w i t h Deadline Constraint

Efficient Data Replication Scheme based on Hadoop Distributed File System

Hadoop IST 734 SS CHUNG

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce and Hadoop Distributed File System V I J A Y R A O

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Research Article Hadoop-Based Distributed Sensor Node Management System

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

Fault Tolerance in Hadoop for Work Migration

Large scale processing using Hadoop. Ján Vaňo

Distributed Framework for Data Mining As a Service on Private Cloud

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Log Mining Based on Hadoop s Map and Reduce Technique

MapReduce and Hadoop Distributed File System

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Comparison of Different Implementation of Inverted Indexes in Hadoop

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Telecom Data processing and analysis based on Hadoop

NextGen Infrastructure for Big DATA Analytics.

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Apache Hadoop. Alexandru Costan

Data-Intensive Computing with Map-Reduce and Hadoop

Statistical Analysis of Web Server Logs Using Apache Hive in Hadoop Framework

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

A Database Hadoop Hybrid Approach of Big Data

Advances in Natural and Applied Sciences

Map Reduce & Hadoop Recommended Text:

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

BIG DATA USING HADOOP

Healthcare Big Data Exploration in Real-Time

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Benchmarking Hadoop & HBase on Violin

BIG DATA CHALLENGES AND PERSPECTIVES

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Manifest for Big Data Pig, Hive & Jaql

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Technology for Flow Analysis of the Internet Traffic

Reduction of Data at Namenode in HDFS using harballing Technique

Big Data - Infrastructure Considerations

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

A Survey on Big Data Concepts and Tools

COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

NoSQL and Hadoop Technologies On Oracle Cloud

BIG DATA ANALYSIS USING RHADOOP

Mobile Cloud Computing for Data-Intensive Applications

Introduction to Hadoop

Hadoop Big Data for Processing Data and Performing Workload

CSE-E5430 Scalable Cloud Computing Lecture 2

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

Open source large scale distributed data management with Google s MapReduce and Bigtable

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

BIG DATA What it is and how to use?

A Brief Outline on Bigdata Hadoop

Application Development. A Paradigm Shift

The Hadoop Framework

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Transcription:

ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP Livjeet Kaur Research Student, Department of Computer Science, Punjabi University, Patiala, India Abstract In the present study, we have compared and analyzed the features of java and map/reduce on hadoop. Two algorithms i.e, Maximum Finder and Anagram Finder have been used as the basis of the study. These programs were run on same data size to analyze the performance improvement on different criterions such as storage, processing, networking and efficiency. Conclusion drawn after evaluating the results shows the improvement in execution time when using optimized Map/Reduce Algorithm and for general Map/reduce program and optimized one. 1. Introduction There are lot of applications generating huge amount of data daily. Many companies have collected the data in multiple forms and a huge wealth of data gets accumulated with every passing day. It could be of a great benefit for many stakeholders to dig out the useful information present in this ocean of data production. Hadoop is made to handle these problems. Hadoop is the Apache Software Foundation open source and Java-based implementation of the Map/Reduce framework. Hadoop provides the tools for processing vast amounts of data using the Map/Reduce framework and, additionally, implements the Hadoop Distributed File System (HDFS). Map/Reduce is Google`s programming framework which perform parallel computations on large datasets, for example, web request logs and crawled documents. The huge amounts of data require to be distributed over a number of machines (nodes). Storage sources are contributed equally by each participating node. These nodes are connected under the same distributed file system. Furthermore, over the locally stored same data, same computations are performed by each machine due to which it results in parallel and largescale distributed processing. Copyright to IJASPM, 2015 IJASPM.org 212

HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. To illustrate this with an example take the case of Oil & Gas industry. Service companies like Schlumberger, BHI, and Halliburt on perform the drilling for such industries. Drilling data is collected by these companies through placing sensors on drilling bits and platforms. This data is then made available on their servers. Drilling status is indicated by the real time drilling data. Application of reasoning algorithm on the historical data can provide useful information to the operators. But as time goes by, data gets accumulated to enormity. When the data set reaches a huge size like several gigabytes, it becomes very time consuming or practically impossible to do some reasoning on the huge amount of data. Map/Reduce system is a valuable framework in these kinds of situations. The aim of this study is that the data is growing at exponential rate at present. However, the speed with which data can be read or written to the disk is increasing gradually. In 1990, the data storage capacity of one typical drive was 1370 MB with a transfer rate about 4.4 Mb/s. So the time consumed to read the whole disk was 5 minutes but now disk size of 1 TB is norm with the transfer rate of about 100MB/s. So the time taken to read all the data from the disk now is more than 2 hours. In the present work, we have analyzed the features of java and map/reduce on hadoop platform. In all the programming languages like Java, data is read in the sequence so their processing speeds depend upon the disk transfer rate. But if we talk about Map/Reduce systems they operate on the distributed file system which support parallel processing. So we have tried to find out the performance of various algorithms like Maximum Finder and Anagram Finder. When these programs were run on same data size on java and map/reduce, the performance improvement has been analyzed on different criterions such as storage, processing, networking and efficiency. 2. METHODOLOGY In our work, the data which we have used for processing was in megabytes because of limitations in hardware we used. It is not feasible to use data in Terabytes, Exabytes and Petabytes because in our experimental setup, the node we have used has RAM of only 8 GB and hard disk of only 1 TB. The processor we used was AMD processor We installed the VMware player for virtual machine in our node. On the virtual machine the operating system used was Centos operating system and the Hadoop was installed on virtual machine. The concepts of networking were analyzed on the local host. We have utilized two frameworks Map/Reduce and Java on hadoop platform for comparing the performance. Hadoop has run in pseudo mode and this single node Hadoop cluster has been tested based on 2 scenarios. 1. Time Constraint Copyright to IJASPM, 2015 IJASPM.org 213

2. Data Constraint For the first scenario, optimized Java algorithm and Map/Reduce worst case and best case algorithm have been written and a comparative study has been performed to check how fast Map/Reduce is than the Java program. For the second scenario these algorithms have been tested with different amount of data to check execution time taken by Java and Map/Reduce which showed the CPU utilization. After getting results from both the scenarios, the results were analyzed to evaluate the performance in terms of Processing Power of Map/Reduce best case and worst case with respect to optimized Java algorithm. Algorithm of Anagram Finder 1. Word:=Input word whose anagram is to find 2. Sortedword:=Sort character of given word 3. While(!eof) Do Currentword: =Current word from file Sortedcurrentword: =Sort current word If (sortedword==sortedcurrentword) Print (anagram of word is current word) End if End while Algorithm of Maximum Finder 1. Max :=integer.min 2. While (!eof) ( int current := current number from file If (current gt max) Max: =current End if End while Copyright to IJASPM, 2015 IJASPM.org 214

3. RESULTS AND DISCUSSIONS Comparing Hadoop and Java on Anagram program a. In terms of Processing 1 node Hadoop cluster Table 1: Comparison of Hadoop and Java with Anagram program. Data size in mb Time taken by Hadoop in ms Time taken by JAVA in ms 60 mb 59914 78047 125 mb 131828 158798 250 mb 263656 332188 500 mb 447313 624378 The above table shows the test results of anagram program run on different data sizes in megabytes on Hadoop and Java and the time taken for processing the program is in milliseconds. The inference drawn is that the processing time taken for processing by Hadoop is less as compared to Java. Copyright to IJASPM, 2015 IJASPM.org 215

Graph1.Comparison of Hadoop and Java on anagram program Time in Milli Seconds Data in mega Bytes In the above graph, Y-axis represents time in milliseconds for processing anagram program on Hadoop and Java and x-axis shows the data size of the file in megabytes. From the graph it is clear that the time taken by Hadoop in processing is less as compared to Java. As the sample data size increases the difference between the processing time of Hadoop and Java also increases as processing speed in Hadoop is better in processing large data size and the processing speed of Java decreases when the data size increases. Comparing Hadoop and Hadoop Optimized Algorithm In terms of Processing 1 node cluster Table2. Comparison of Hadoop and Hadoop Optimized with Anagram program Copyright to IJASPM, 2015 IJASPM.org 216

Data size Time taken by Hadoop in ms Time taken by Hadoop Optimized in ms 60 mb 59914 41764 125 mb 131828 73327 250 mb 263656 146655 500 mb 447313 253309 In this table, we have run anagram program of different data sizes in megabytes on Hadoop and Hadoop Optimized and we have shown the time taken in milliseconds for processing the program on different size of file. We have concluded that the processing time taken by Hadoop Optimized in milliseconds is less as compared to Hadoop. Graph2.Comparison of Hadoop and Hadoop Optimized with Anagram program Time in Milli Seconds Copyright to IJASPM, 2015 IJASPM.org 217

Data in mega Bytes In this graph, on y-axis we have time in milliseconds for processing anagram program on Hadoop and Hadoop optimized. on x-axis we have taken data size of the file in megabytes. As shown in graph, the time taken by Hadoop optimized on processing is less as compared to Hadoop. As the data size increases the difference between the bar of Hadoop and Hadoop optimized also increases because the Hadoop optimized processing speed is better on large data size as compared to Hadoop. Comparing Hadoop Optimized Algorithm and Java Algorithm In terms of processing Table3.Comparison of Java and Hadoop Optimized with Anagram program Data size Time taken by Hadoop Optimized in ms Time taken by JAVA in ms 60 mb 125 mb 250 mb 500 mb 41764 78047 73327 158798 146655 332188 253309 624378 In this table, we have run anagram program of different data sizes in megabytes on Hadoop Optimized and Java and we have shown the time taken in milliseconds for processing the program on different size of file. We have concluded that the processing time taken by Hadoop Optimized in milliseconds is less as compared to Java. Copyright to IJASPM, 2015 IJASPM.org 218

Graph3.Comparison of Java and Hadoop Optimized with Anagram program Time in Milli Seconds Data in mega Bytes In the above graph, on y-axis we have time in milliseconds for processing anagram program on Java and Hadoop optimized. On x-axis we have taken data size of the file in megabytes. As shown in graph, the time taken by Hadoop optimized on processing is less as compared to Java. As the data size increases the difference between the bar of Java and Hadoop optimized also increases because the Hadoop optimized processing speed is better on large data size as compared to Java. b.) In terms of Efficiency Hadoop is more efficient than Java in terms of processing because of following reasons : i) Parallel Processing rather than sequential processing ii) Movement of code rather than movement of data Parallel Processing :Hadoop uses MapReduce Programming which uses the concept of parallel processing of algorithm rather than sequential processing which Java supports. Movement of Code rather than data: In Java data is moved to the code for processing but in case of Hadoop code is moved to data due to concept of Data Locality. Copyright to IJASPM, 2015 IJASPM.org 219

c.) In terms of storage Hadoop is more efficient than Java in terms of storage because of following reasons: i) Fault Tolerance ii) Data Recovery iii) Back up Fault Tolerance: Hadoop is designed to handle network faults, System crashes and so on in a very diligent way. Data Recovery: When every your node in a cluster crashes(system) or theirs is some delay in the response due to network congestion. Hadoop tries to get the data or response from other system in the cluster. Back Up: Hadoop itself do the back up of your data. By default file saved onto Hadoop is divided into a block of 64 mb and each block is replicated 3 times onto different nodes in the cluster depending upon the replication factor. d.) In terms of Networking Java algorithms can be designed to run on the system of computers(networks) but it would always run slow as compared to Hadoop because of following reason : i) Data Locality Data Locality: In a traditional programming languages(sequential/object oriented) it is data that is always moved toward the code. So if we have networks of computers movement of data between computers can become a hectic task but in case of Hadoop your job is always tried to run over the node which is storing your data so there is always less movement of code.this concept is called Data Locality 4. CONCLUSION A conclusive study has shown that Hadoop gives better results than java when the data size is large. But for single node and small data size, java performs better because when working on a single node, Hadoop cannot work parallel which decreases the processing time of Hadoop. Our experimental results have shown a 40-50% improvement in execution time when using optimized Map/Reduce Algorithm and 15-30% improvement in execution time for general Map/reduce program and optimized one. There are several threads running as a background process. Map/reduce programs are also executed by daemons. The processes i.e. jobtracker and tasktracker running on the system as well as Copyright to IJASPM, 2015 IJASPM.org 220

on a cluster of single node use the concepts of networking. So, due to some internal networking conjunction and priorities to different threads a small variance in the execution time has been observed. REFERENCES 1. Liu, J., Ravi, N., Chakradhar, S., &Kandemir, M. (2012, March). Panacea: towards holistic optimization of MapReduce applications. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (pp. 33-43). ACM. 2. Jahani, E., Cafarella, M. J., &Ré, C. (2011). Automatic optimization for MapReduce programs. Proceedings of the VLDB Endowment, 4(6), 385-396. 3. Dittrich, J., Quiané-Ruiz, J. A., Jindal, A., Kargin, Y., Setty, V., &Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2), 515-529. 4. Yan, J., Yang, X., Gu, R., Yuan, C., & Huang, Y. (2012, November). Performance Optimization for Short MapReduce Job Execution in Hadoop. In Cloud and Green Computing (CGC), 2012 Second International Conference on (pp. 688-694). IEEE. 5. Lin, X., Tang, W., & Wang, K. (2012, December). Predator an experience guided configuration optimizer for Hadoopmapreduce. In Proceedings of the 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom) (pp. 419-426). IEEE Computer Society. 6. Lu, H., Hai-Shan, C., & Ting-Ting, H. (2012, October). Research on Hadoop Cloud Computing Model and its Applications. In Networking and Distributed Computing (ICNDC), 2012 Third International Conference on (pp. 59-63). IEEE. 7. El-Helw, I., Hofman, R., &Bal, H. E. (2014, November). Scaling MapReduce vertically and horizontally. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for (pp. 525-535). IEEE. Copyright to IJASPM, 2015 IJASPM.org 221

8. Dittrich, J. and Quiané-Ruiz, J. (2012, August). Efficient big data processing in Hadoop MapReduce. In Proceedings of the VLDB Endowment. 5 (12), (pp.2014-2015).acm 9. Joshi, S. B. (2012, April). Apache Hadoop performance-tuning methodologies and best practices. In Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering (pp. 241-242). ACM. 10. Sandholm, T., & Lai, K. (2009). MapReduce optimization using regulated dynamic prioritization. ACM SIGMETRICS Performance Evaluation Review, 37(1), 299-310. 11. Lin, M., Zhang, L., Wierman, A., & Tan, J. (2013). Joint optimization of overlapping phases in MapReduce. Performance Evaluation, 70(10), 720-735. 12. Sakr, S., Liu, A., &Fayoumi, A. G. (2013). The family of MapReduce and large-scale data processing systems. ACM Computing Surveys (CSUR), 46(1), 11. 13. Chen, R., & Chen, H. (2013). Tiled-MapReduce: Efficient and flexible MapReduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 3. 14. Jiang, D., Ooi, B. C., Shi, L., & Wu, S. (2010). The performance of MapReduce: an in-depth study. Proceedings of the VLDB Endowment, 3(1-2), 472-483. 15. Elmeleegy, K. (2013). Piranha: Optimizing short jobs in Hadoop. Proceedings of the VLDB Endowment, 6(11), 985-996. 16. Iu, M. Y., &Zwaenepoel, W. (2010, April). HadoopToSQL: a mapreduce query optimizer. In Proceedings of the 5th European conference on Computer systems (pp. 251-264). ACM. 17. Avraham Shinnar, David Cunningham, Vijay Saraswat, and Benjamin Herta. (2012,August). M3R: increased performance for in-memory Hadoop jobs. In Proceedings of the VLDB Endowment. 5(12), (pp.1736-1747).acm 18. Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., & Huang, Y. (2014). SHadoop: Improving MapReduce performance by optimizing job execution Copyright to IJASPM, 2015 IJASPM.org 222

mechanism in Hadoop clusters. Journal of Parallel and Distributed Computing, 74(3), 2166-2179. 19. Lama, P., & Zhou, X. (2012, September). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In Proceedings of the 9th international conference on Autonomic computing (pp. 63-72). ACM. 20. Ahmad, F., Chakradhar, S. T., Raghunathan, A., &Vijaykumar, T. N. (2012, March). Tarazu: optimizing MapReduce on heterogeneous clusters. In ACM SIGARCH Computer Architecture News (Vol. 40, No. 1, pp. 61-74). ACM. Copyright to IJASPM, 2015 IJASPM.org 223