A review on MapReduce and addressable Big data problems

Transcription

1 International Journal of Research In Science & Engineering e-issn: Special Issue: Techno-Xtreme 16 p-issn: A review on MapReduce and addressable Big data problems Prof. Aihtesham N. Kazi 1, Dr. S.Y. Amdani 2, Prof. Atul B. Kathole 3 1 Assistant Professor J.D.I.E.T, Yavatmal 2 Associate Professor B.N.C.O.E,Pusad 3 Assistant Professor J.D.I.E.T, Yavatmal ABSTRACT We are constantly being told that we live in the Information Era the Age of BIG data. It is clearly apparent that organizations need to employ data-driven decision making to gain competitive advantage. Processing, integrating and interacting with more data should make it better data, providing both more panoramic and more granular views to aid strategic decision making. This is made possible via Big Data exploiting affordable and usable Computational and Storage Resources. Big Data has become a catchphrase to describe data so large that it is not agreeable to processing or analysis using traditional database and software techniques; such Big Data is noted for its volume, varieties of data types, and rapid accumulation. IBM estimates that 2.5 quintillion bytes of data are created daily, and that 90% of the data being used in the world today was generated in the past couple of years [1]. Keywords: BIG Data, Hadoop architecture, NOSQL, Map-Reduce Algorithm, HDFS, Job Scheduling INTRODUCTION Big Data, or large-scale, diverse, and high-resolution data sets [2] is found in business, science, engineering, healthcare, critical infrastructure management, and a variety of other domains. IT budgets are already under increased pressure. If organizations cannot contain their storage spending, there will be little alternative but to transfer resources from other IT projects. Research shows that 47 percent of IT budgets are assigned to maintain IT infrastructure (including storage hardware and networking), 40 percent to information and transaction processing, and 13 percent to strategic IT investments. Corporations see big data as a tool for commercial advantage, particularly in consumer marketing, more recently, the big data idea has been grasped as a mantra by government agencies. In the Big Data era, computation and storage is cheap per TB. Therefore, with ever-growing computational capabilities, system utilization is no longer as critical a factor it is now feasible to use more computational power to do the same work, at the same time, the amount of data that needs processing has been increasing exponentially in the past decade as a result of improvements in data generation and storage capacity [1] 2. LITERATURE REVIEW Firat Tekiner and John A. Keane investigated that there has been an increase in NoSQL approaches to overcome weaknesses of storing and managing variety of data. Despite their relatively recent emergence, there are now more than one hundred NoSQL approaches that specialize in management of different multimodal data types (from structured to non-structured) and with the aim to solve very specific challenges. Most are powered by the Map-Reduce paradigm that came from Google, which is based on a massively distributed architecture that exploits cheap commodity hardware. [3] Big Data is perceived as the new driver of competitive advantage. Big Data applications have high Volume, high Variety and high Velocity as findings are expected to be delivered very quickly. In reality, Big Data is largely driven by the need to analyze massive volumes of data to gain competitive advantage and to use previously intractable processes to find information relationships.

2 This also illustrates about Map-Reduce and hadoop to process and store large datasets on commodity hardware. Then how the Map-Reduce tasks will be applied depends on the application. As each map and reduce process can run in parallel, both can be used to speed up processing. An important challenge is to bring together and map the relational database model with columnar, key-value stores and unstructured data. Finally this shows that what Big Data really is, it is the enterprise data processing environment for heterogeneous data and computational sources in a timely manner to gain competitive advantage. [5] Spyros Blanas, Jignesh M. Patel investigated that The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, they describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100node Hadoop cluster. In fact, IBM is engaged with a number of enterprise customers to prototype novel Hadoopbased solutions on massive amount of structured and unstructured data for their business analytics applications. [3] Grolinger, K.; Hayes, M.; Higashino shows that In the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce paradigm which allows for massively parallel and distributed execution over a large number of computing nodes. This identifies MapReduce issues and challenges in handling Big Data with the objective of providing an overview of the field, facilitating better planning and management of Big Data projects, and identifying opportunities for future research in this field. The identified challenges are grouped into four main categories corresponding to Big Data tasks types: data storage (relational databases and NoSQL stores), Big Data analytics (machine learning and interactive analytics), online processing, and security and privacy. Moreover, current efforts aimed at improving and extending MapReduce to address identified challenges are presented. Consequently, by identifying issues and challenges MapReduce faces when handling Big Data, this study encourages future Big Data research. [4] Katal, A.,Wazid, M.and Goudar, R.H discuss in his paper that Big data is defined as large amount of data which requires new technologies and architectures so that it becomes possible to extract value from it by capturing and analysis process. Due to such large size of data it becomes very difficult to perform effective analysis using the existing traditional techniques. Big data due to its various properties like volume, velocity, variety, variability, value and complexity put forward many challenges. Since Big data is a recent upcoming technology in the market which can bring huge benefits to the business organizations, it becomes necessary that various challenges and issues associated in bringing and adapting to this technology are brought into light. This introduces the Big data technology along with its importance in the modern world and existing projects which are effective and important in changing the concept of science into big science and society too. The various challenges and issues in adapting and accepting Big data technology, its tools (Hadoop) are also discussed in detail along with the problems Hadoop is facing. [5] Alam, A.; Ahmed, J suggested how hadoop is use for social media applications, describes the shortcomings in Hadoop. Hadoop is a distributed paradigm used to manipulate the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is normally used for data intensive applications. It actually holds the huge amount of data and upon requirement perform the operations like data analysis, result analysis, data analytics etc. Now a day's almost every social media is using Hadoop for many intentions like opinion mining etc. [6] Chen Jie; Chen Dongjie; Huang Bangming explained term of big data arising next generation. In recent years, due to the growing popularity of the human network behavior, we meet the new generation of big data era. This makes the emergence of characteristics that large data storage capacity and fast business growth. For large storage capacity, instability, low retrieval time caused by large data information, this paper demonstrate a distributed storage and server clustering approach to solve the problem of tradition storage capacity, high-availability of retrieval and servers. For the arrival of the era of big data, how to improve the retrieval times, the storage capacity and stability of system, these provide for the future construction of a guiding role. [7] Sivaraman, E.; Manickachezian, R stated Hadoop is a quickly budding ecosystem of components based on Google's MapReduce algorithm and file system work for implementing MapReduce algorithms in a scalable fashion and distributed on commodity hardware. Hadoop enables users to store and process large volumes of

3 data and analyse it in ways not previously possible with SQL-based approaches or less scalable solutions. Remarkable improvements in conventional compute and storage resources help make Hadoop clusters feasible for most organizations. This paper begins with the discussion of Big Data evolution and the future of Big Data based on Gartner's Hype Cycle. We have explained how Hadoop Distributed File System (HDFS) works and its architecture with suitable illustration. Hadoop's MapReduce paradigm for distributing a task across multiple nodes in Hadoop is discussed with sample data sets. The working of MapReduce and HDFS when they are put all together is discussed. Finally the paper ends with a discussion on Big Data Hadoop sample use cases which shows how enterprises can gain a competitive benefit by being early adopters of big data analytics.[8] We live in on-demand, on-command Digital universe with data prolifering by Institutions, Individuals and Machines at a very high rate. This data is categories as "Big Data" due to its sheer Volume, Variety and Velocity. Most of this data is unstructured, quasi structured or semi structured and it is heterogeneous in nature. The volume and the heterogeneity of data with the speed it is generated, makes it difficult for the present computing infrastructure to manage Big Data. Traditional data management, warehousing and analysis systems fall short of tools to analyze this data. Due to its specific nature of Big Data, it is stored in distributed file system architectures. Hadoop and HDFS by Apache are widely used for storing and managing Big Data. Analyzing Big Data is a challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. Map Reduce is widely been used for the efficient analysis of Big Data. Traditional DBMS techniques like Joins and Indexing and other techniques like graph search is used for classification and clustering of Big Data. These techniques are being adopted to be used in Map Reduce. In this paper we suggest various methods for catering to the problems in hand through Map Reduce framework over Hadoop Distributed File System (HDFS). Map Reduce is a Minimization technique which makes use of file indexing with mapping, sorting, shuffling and finally reducing. Map Reduce techniques have been studied in this paper which is implemented forbig Data analysis using HDFS. [9] Hadoop is an open source cloud computing platform of the Apache Foundation that provides a software programming framework called MapReduce and distributed file system, HDFS. It is a Linux based set of tools that uses commodity hardware, which are relatively inexpensive, to handle, analyze and transform large quantity of data. Hadoop Distributed File System, HDFS, stores huge data set reliably and streams it to user application at high bandwidth and MapReduce is a framework that is used for processing massive data sets in a distributed fashion over a several machines. This paper gives a brief overview of Big Data, Hadoop MapReduce and Hadoop Distributed File System along with its architecture. [10] Hadoop_Mapreduce is winning more and more attention with its open source distributed parallel computing technology, high efficiency and economy. This paper describes the Hadoop_Mapreduce and its related technologies, studies in detail the thought behind the Hadoop_MapReduce algorithm and the system architecture, and for the field of internet search relating massive amounts of data, discusses the algorithm parallelization method of the algorithm and its feasibility, and applies MapReduce model on the Hadoop platform to realize inverted index by means of parallelization. [11] The K-Medoids clustering algorithm solves the problem of the K-Means algorithm on processing the outlier samples, but it is not be able to process big-data because of the time complexity [1]. MapReduce is a parallel programming model for processing big-data, and has been implemented in Hadoop. In order to break the bigdata limits, the parallel K-Medoids algorithm HK-Medoids based on Hadoop was proposed. Every submitted job has many iterative MapReduce procedures: In the map phase, each sample was assigned to one cluster whose center is the most similar with the sample; in the combine phase, an intermediate center for each cluster was calculated; and in the reduce phase, the new center was calculated. The iterator stops when the new center is similar to the old one. The experimental results showed that HK-Medoids algorithm has a good clustering result and linear speedup for big-data. [12] HAN HU YONGGANG WEN, TAT-SENG CHUA AND XUELONG LI explained and focuses on scalable big-data systems, which include a set of tools and mechanisms to load, extract, and improve disparate data while leveraging the massively parallel processing power to perform complex transformations and analysis. Owing to the uniqueness of big-data, designing a scalable big-data system faces a series of technical challenges, including: 1) First, due to the variety of disparate data sources and the sheer volume, it is difficult to collect and integrate data with scalability from distributed locations. For instance, more than 175 million tweets containing text, image, video, social relationship are generated by millions of accounts distributed globally. 2) Second, big data systems need to store and manage the gathered massive and heterogeneous datasets, while provide function and performance guarantee, in terms of fast retrieval, scalability, and privacy protection. For example, Facebook needs to store, access, and analyze over 30 pertabytes of user generate data.

4 3) Third, big data analytics must effectively mine massive datasets at different levels in realtime or near realtime - including modeling, visualization, prediction, and optimization - such that inherent promises can be revealed to improve decision making and acquire further advantages. [13] 3. AIM OF SYSTEM Big Data definition having 4 V s as its problem statements that are: Volume, Velocity, Variety, Value [13]. The main aim of this work is to focus on velocity and variety as problems; in this velocity deals with amount of data transfer and processed by MapReduce techniques in distributed cluster environments. So the aim of this research work is to design and implements the algorithms for Job Scheduling of MapReduce to minimizes Fault tolerance and to increase resource availability based on locality of resources in Hadoop cluster so that it will increases the performance of Big data processing applications. The two major performance metrics in MapReduce are job execution time and cluster throughput. They can be seriously impacted by straggler machines, (the machines on which tasks take an unusually long time to finish) the focuses on to unload the straggler machine task so that it can improve the cluster throughput. MapReduce s most significant advantages is that it provides an abstraction that hides many systemlevel details from programmer. It processes data by dividing the progress into two phases: Map and Reduce. Figure 1 shows that the Map function is applied to each input key-value pair and generates an arbitrary number of intermediate key-value pairs. The Reduce function is applied to all values that associated with the same intermediate key and generates output key-value pairs as the final result. 4. OBJECTIVES Figure1. The MapReduce computing framework The proposed research work will try to achieve some or all of the following objectives: Development of Algorithm(s) to increase the performance in big data processing by enhancing Job scheduling mechanism in considering heterogeneous cluster environment hadoop. Development of Algorithm(s) to increase the performance in big data processing by enhancing the fault tolerance mechanism in hadoop. To find out the comparison result of various algorithms used for fault tolerance to reduce the overheads of Scheduler of MapReduce cluster in hadoop architecture. To suggest suitable combination algorithms form the velocity of Big data to solve the Big data problems. 5. METHODOLOGY The design will be based on big data framework and Hadoop architecture to address the various issues related to the Big data problems. The proposed research work can be to first study of impact of various categories of data

5 such as either structure or unstructured while processing huge amount of data then adopting the proper combination of algorithms to minimizes the burden on processing infrastructures. A secondly to evaluate the large datasets that generated from our primary proposed work will be compare with the original datasets by using big data tools 6. SCOPE Drastically increments in volume of data leads to putting lot of Burdon on servers in terms of storage cost and performance issues, provides a scope to find out the feasible solutions, like data can be process by allowing dynamic allocation of jobs to node using different enhance algorithms. This research is an aim to design and implementation of enhance scheduling algorithms and minimization of fault tolerance in heterogeneous enlivenment of cluster of hadoop 7. CONCLUSION The proposed research work is expected to provide reasonable improvements in performance of Hadoop algorithms and better analytical outcomes for any big data applications. ACKNOWLEDGEMENT Prof. Aihtesham N. Kazi and Prof. Atul B. Kathole are assistant professor at Jawaharlal Darda Institute of Engineering and Technology, Yavatmal, MS India, Dr. S.Y. Amdani is associate professor and Head at Babasaheb Naik college of engineering Pusad. MS India REFERENCES [1] Xindong Wu, Fellow, Xingquan Zhu, Senior Member: Data Mining with Big Data, IEEE transactions on knowledge and data engineering, vol. 26, no. 1, january 2014 [2] Shilpa and Manjit Kaur: BIG Data and Methodology-A review International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October , pp Lijuan Zhou, Hui Wang, Wenbo Wang, Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment, TELKOMNIKA Indonesian Journal of Electrical Engineering Vol.10, No.5, September 2012, pp. 1087~1092 [3] A Comparison of Join Algorithms for Log Processing in MapReduce. SIGMOD '10 Proceedings of the 2010 ACM SIGMOD International Conference on Management of data [4] Grolinger, K.; Hayes, M.; Higashino, W.A.; L'Heureux, A.; Allison, D.S.; Capretz, M.A.M., "Challenges for MapReduce in Big Data," Services (SERVICES), 2014 IEEE World Congress on, vol., no., pp.182,189, June July [5] Firat Tekiner and John A. Keane, Big Data Framework, 2013 IEEE International Conference on Systems, Man, and Cybernetics [6] Alam, A.; Ahmed, J., "Hadoop Architecture and Its Issues," Computational Science and Computational Intelligence (CSCI), 2014 IEEE International Conference on, vol.2, no., pp.288,291, March 2014

6 [7] Chen Jie; Chen Dongjie; Huang Bangming, "Research on big data information retrieval based on hadoop architecture," Electronics, Computer and Applications, 2014 IEEE Workshop on, vol., no., pp.492,495, 8-9 May 2014 [8] Sivaraman, E.; Manickachezian, R., "High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing Using Hadoop," Intelligent Computing Applications (ICICA), IEEE 2014 International Conference on, vol., no., pp.32,36, 6-7 March 2014 [9] Manikandan, S.G.; Ravi, S., "Big Data Analysis Using Apache Hadoop," IT Convergence and Security (ICITCS), 2014 International Conference on, vol., no., pp.1,4, Oct [10] Singh, K.; Kaur, R., "Hadoop: Addressing challenges of Big Data," Advance Computing Conference (IACC), 2014 IEEE International, vol., no., pp.686,689, Feb [11] AiLing Duan; HaiFang Si, "Research and Practice of Distributed Parallel Search Algorithm on Hadoop_MapReduce," Control Engineering and Communication Technology (ICCECT), 2012 International Conference on, vol., no., pp.105,108, 7-9 Dec [12] Yaobin Jiang; Jiongmin Zhang, "Parallel K-Medoids clustering algorithm based on Hadoop," Software Engineering and Service Science (ICSESS), th IEEE International Conference on, vol., no., pp.649,652, June ] HAN HU YONGGANG WEN, TAT-SENG CHUA AND XUELONG LI, Toward Scalable Systems for Big Data Analytics: A Technology Tutorial, IEEE access date of publication June 24, 2014