Optimization and analysis of large scale data sorting algorithm based on Hadoop

Similar documents
Chapter 7. Using Hadoop Cluster and MapReduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Big Data With Hadoop

MapReduce. Tushar B. Kute,

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data and Apache Hadoop s MapReduce

Big Data and Scripting Systems beyond Hadoop

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

BSPCloud: A Hybrid Programming Library for Cloud Computing *


Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data with Rough Set Using Map- Reduce

Parallel Processing of cluster by Map Reduce

Hadoop Scheduler w i t h Deadline Constraint

MapReduce (in the cloud)

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Evaluating partitioning of big graphs

MAPREDUCE Programming Model

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Data-Intensive Computing with Map-Reduce and Hadoop

Log Mining Based on Hadoop s Map and Reduce Technique

Survey on Scheduling Algorithm in MapReduce Framework

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Data Processing with Google s MapReduce. Alexandru Costan

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Task Scheduling in Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Developing MapReduce Programs

Graph Processing and Social Networks

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

From GWS to MapReduce: Google s Cloud Technology in the Early Days

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

A Performance Analysis of Distributed Indexing using Terrier

GraySort on Apache Spark by Databricks

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

NoSQL and Hadoop Technologies On Oracle Cloud

Chase Wu New Jersey Ins0tute of Technology

Constructing a Data Lake: Hadoop and Oracle Database United!

Introduction to Hadoop

Distributed Apriori in Hadoop MapReduce Framework

16.1 MAPREDUCE. For personal use only, not for distribution. 333

L1: Introduction to Hadoop

Big Data Analytics Hadoop and Spark

MapReduce and Hadoop Distributed File System V I J A Y R A O

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop Architecture. Part 1

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Comparison of Different Implementation of Inverted Indexes in Hadoop

Fault Tolerance in Hadoop for Work Migration

Introduction to DISC and Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Efficient Analysis of Big Data Using Map Reduce Framework

Integrating Big Data into the Computing Curricula

Architectures for massive data management

Introduction to MapReduce and Hadoop

Data and Algorithms of the Web: MapReduce

Distributed Framework for Data Mining As a Service on Private Cloud

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Map Reduce & Hadoop Recommended Text:

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop Design and k-means Clustering

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Introduction to Parallel Programming and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce


Hadoop and Map-reduce computing

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

A computational model for MapReduce job flow

Getting to know Apache Hadoop

MapReduce Evaluator: User Guide

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Hadoop Ecosystem B Y R A H I M A.

Hadoop IST 734 SS CHUNG

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Performance and Energy Efficiency of. Hadoop deployment models

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Transcription:

Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo, tianlonglong, guodianjie, jiangxiaoming}@iie.ac.cn Abstract When dealing with massive sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large sets across clusters of computers using simple programming models. A common approach in implement of big sorting is to use shuffle and sort phase in MapReduce based on Hadoop. However, if we use it directly, the efficiency could be very low and the load imbalance can be a big problem. In this paper we carry out an experimental study of an optimization and analysis of large scale sorting algorithm based on hadoop. In order to reach optimization, we use more than 2 rounds MapReduce. In the first round, we use a MapReduce to take sample randomly. Then we use another MapReduce to order the uniformly, according to the results of the first round. If the is also too big, it will turn back to the first round and keep on. The experiments show that, it is better to use the optimized algorithm than shuffle of MapReduce to sort large scale. 1 Introduction 1.1Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large sets across clusters of computers using simple programming model [1]. It is designed to speed up from single servers to hundreds of or thousands of machines, each offering local storage and computation. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. 1.2 MapReduce MapReduce is a programming model and an associated implementation for processing and generating large sets [2]. A map function can process a key/value pair to generate a set of intermediate key/value pairs, and a reduce function merges all intermediate values associated with the same intermediate key. This model can express many tasks in our real world. Programs written in this kind of functional style can be automatically parallelized and executed on a large cluster of commodity machines. The run-time system deals with the details of scheduling the program s execution across a set of machines and managing the required inter-machine communication, partitioning the input, handling machine failures. This let programmers without

any experience with distributed systems and parallel to easily use the resources of a large distributed system. MapReduce makes sure that the input to every reducer is sorted by key. The process by which the system executes the sort and transfers outputs of the map to the reducers as inputs is known as shuffle. In many ways, the shuffle is the heart of MapReduce. Shuffle is the process of transferring from the mappers to the reducers. Sorting in shuffle saves time for the reducer, helping it easily differentiate when a new reduce task should start. It begins a new reduce task simply, when the following key is different from the previous one, to put it easily. Figure 1-1 the process of MapReduce 1.3 Optimization of the sorting algorithm As the efficiency of shuffle may be low, we can achieve optimization based on Hadoop by some operating. In our work, we decide to use more than 2 rounds MapReduce. Because some nodes in cluster may take very heavy jobs while the others have basically no work to do. So we first to divide the to be sorted. In the first round, we use a MapReduce to take sample randomly. After it, we almost know the distribution of the. We can create some new files, every of which has average.then we use another MapReduce to order the uniformly, according to the results of the first round. If the is also too big, it will turn back to the first round to be divided and keep on. The experiments show that, it is better to use the optimized algorithm than shuffle of MapReduce to sort large scale. 2 Design and implementation 2.1 Basic idea of the algorithm

The basic idea is that the is divided into inter-block ordered, so that it can be sorted in the memory as much as possible. Figure 2-1 shows the general design of this algorithm. 1 2 3... n 1st MR Random Sample Loading Balance 1 Partition points 2 2nd Map Cluster Data 3 File_names... 2nd Reduce n Data Processing Big_File_names resu lt1 resu lt2... Figure2-1 the process of this algorithm resu ltm

I division and sort based on files (1)Get the distribution of to be sorted, and set the division site. First obtain the distribution information of the files. Take three sites of to make sample for each file to be sorted, and sample 4KB for each site, using the style like <, the number of > to save it in HashMap. The in map will be sorted by priority queue after statistics. Set division site, where the total number of sampling is counted that is named samplecount; obtained the total length of all files to be sorted by the meta information of files and the length is named totallength; set the size of to be sorted on every reduce manually, and the size is named blocksize; and set the length of divided segment is dividenums, so we can know that totallength divided by samplecount equals blocksize divided by dividenums,dividenums equals samplecount multiplied by blocksize divided by totallength, combining with priority queuing we can be obtained classified site. Set the number of reduce by adding the number of classified site and 1, and we can set the threshold value as a reference, whichever is the lowest value will be the number of reduce. (2) For each process of the Map, we can obtain the information of classified site through the process of setup; initialize the middle file, and generate n (the number of classified site + 1) files; Generate files using a random number, if there is a conflict, then re-generate the file name. When processing each, is stored in different files according to classified site; Wait for the end of writing files after the processing; take the dividing segment of each file as a key, the file path and file name as a value; Send to the reduce according to the partition function. (3) Use the number of key module reduce to send out map into reduce in partition process. (4) After obtaining <key, value> in reduce, we know what the corresponding divided segment is and the corresponding segments are; Obtain the meta information of these files, the total length of the statistics, if it is greater than the threshold value in RAM, then write the serial number of corresponding in output, and return with doing nothing; Otherwise, the contents of each file is read and placed in the priority queue for sorting in memory, write the result into a file named the number of segment, and put in the result directory, such as /result/0.that is to say, if the output directory is /result/, then corresponding number of segment is 0. II divide and sort again based on files (1)Read the contents of the output files and save all of the unsorted segment numbers. (2)Take the corresponding the middle file as a target for each unsorted segment number; Repeat the procedure1 again; Generate file like /result/1_2, and 1 is the first unsorted bit segment, 2 is the bit segment corresponding sort in process 2. III division and sort based on sets. For the second unsorted bit segment, send the in different interval to different reduce in accordance with different partition method after generating the division by classifying site in process 1; Write the results into directory after automatically sorting in shuffle, which corresponds

to the format of the form /result/1_2_3, 1 and 2 are the first two unsorted corresponding bit segment, 3 is the sorted bit segment. 2.2 concrete implementation of the algorithm The concrete processes executes as follows. ⑴ Take sample simply and divide the sets. We can suppose that the size of whole sets is 100M and the threshold of the division is 20M. We can get that the count of the sample is 100 and the number of divisions equals 20. So the situation of division is showed in Figure 2-2. And the number of reducer is 6. Figure 2-2 take sample and divide the set ⑵ Implement the setup function in the Mapper and the intermediate results are saved in the directory middle. The different filenames corresponding to its scope of. For example, the number 23 corresponding to the file Middle/0/_0.23568 and the is larger than 23 and smaller than 42 corresponding to the file Middle/1/_0.54668 and so on. This step is showed in Figure 2-3.

Figure 2-3 the filenames corresponding the scope of ⑶ Figure 2-4 shows the process that map generate the different division file. If map2 and map1 generate the same filename, we need to generate filenames again till there is not conflict in filenames. Figure 2-4 the process that map generate filenames

⑷ Figure 2-5 shows the content of different file generated in step 3. Figure 2-5 the content of the intermediate files ⑸ Send the number of scope and the corresponding directory to the reducer. We implement this by a function that the number modules the number of reducer. ⑹ Figure 2-6 and Figure 2-7 show that the size of file is smaller than 20M and write the files into the directory. Otherwise, return the directly. Figure 2-6 size of file is smaller than 20M

Figure 2-7 size of file is larger than 20M ⑺ The process ends when there isn t number of scope outputting. Otherwise, repeat from step 1 to step 6. ⑻ The final result can be seen in Figure 2-8. The number of MapReduce process depends on the precision which the sample represent the whole sets. Figure 2-8 the final result

3 Results and analysis We have implemented this algorithm with Java. We have already run this algorithm on the pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. The program sorts several sets with different size and distribution. Then we sort the same sets with the process of shuffle. The Table 3-1 shows the result of comparison. size baseline(s) new_partition(s) 30M 41 42 40M 45 45 50M 62 50 60M 58 52 70M 57 58 80M 62 56 90M 74 60 100M 77 65 110M 81 76 120M 88 80 130M 96 90 140M 97 88 150M 112 100 180M --- 140 Table 3-1 the running time of two processes Table 3-1 shows that out algorithm performs better than sort process of shuffle. We generate the testing randomly. And these situations can represent the distribution of normal sets. We can also find that the program executes the process 2 barely. It means that the sample is typical for the whole sets. The sort process of shuffle cannot work well when the size of input is larger than 180M. The pseudo-distributed mode we used doesn t possess enough memory to process surplus. However, the algorithm we design can work well because this process avoid to utilize the shuffle and try to sort in the main memory. We cannot test larger sets due to the memory limit of the memory. But the algorithm we design indeed performs better than the sort process of shuffle.

4 Conclusion and future work The contribution of this paper is a new strategy which is suitable for large-scale sorting and load-balancing based on Hadoop. Considering the load-imbalance leading by the assigning jobs of MapReduce, we take samples simply of the sets and get the information about its distribution basically. We partition the original sets averagely to make the work of each Reduce jobs approximately equivalent. In the process of MapReduce, there is a process named shuffle between Mapper and Reducer and shuffle can sort the output of Mapper by the key. The key-value pairs are written into the local file system disks, and sent to the same Reducer if the output has the same key. Therefore, shuffle can make the sets sorted. By comparing the efficiency that the shuffle process sort the sets, we can find that this algorithm performs better than the sort processing of shuffle in some occasions. We will perform this algorithm on full-distributed model consist of clusters and test large scale sets to verify the correctness of this algorithm. We are about to improve this algorithm continuously. We have already learned other paralleling models such as Spark, Pregel, and GraphLite. Spark is an architecture for fast and general processing on large clusters [3]. Pregel and GraphLite are systems built for large scale graph processing [4]. These models perform well on the situations that exists many iterative steps. We will modify this algorithm to adapt to these models.

References [1] Apahce Hadoop. https://hadoop.apache.org/ [2] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. [3] Matei Zaharia, An Architecture for Fast and General Data Processing on Large Clusters. [4] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD 10, pages 135 146, 2010.