TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS

Transcription

1 TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS Dr. D Rajya Lakshmi 1, Mr. R Praveen Kumar 2, Mr. N K Sumanth 3 1 Professor of CSE, JNTUK-UCEV, Vizianagaram, AP, (India) 2,3 Asst. Proff.(C),CSE, JNTUK-UCEV, Vizianagaram, AP, (India) ABSTRACT There is a lot of hype around Hadoop now a day`s, this mistakenly leads to a fact that Hadoop can do anything but it s not. Hadoop is meant for certain kind of applications and not suitable for real time applications. But still we go for Hadoop in enterprises for its extreme horse power in processing huge data in less time.hadoop is an open-source, framework of tools that is used to support the running of applications on Big Data. The simple Hadoop Architecture consists of mainly two components Hadoop Distributed File system and Map-Reduce. HDFS breaks the data/file into fixed size blocks and spreads over the cluster.map-reduce will take care about Computation by moving the code or piece of code to the data/blocks where they are present in a cluster of commodity hardware. Hadoop overcomes the Limitations of Distributed File System by providing the features like stores large data,offers Failure Protection and provides fast access but hadoop is built for specific class of applications. HDFS blocks are large, the default block size is 64MB compared to the block size in other structured-file system 4KB-8KB. The major contribution of this work is Hadoop performance evaluation on writing/reading with variable block sizes that results low performance and full understanding of Map Reduce concept as well and the distribution of jobs.at any stage hadoop clusters can run with a default block sizes, which is fixed (multiples of 64MB). Here reading/writing of such huge filecan vary from one node to other depends on commodity hardware. We are taking only one parameter that is Disks Speed. Traditions Hard Disks speed could be MB/sec, now we have come up with Solid state Disk that could run with a speed of MB/sec. That means the hard disk which have high performance will read the entire file andcomputes fast and others will compute slow. The Job Tracker should wait for allcomputations to return final result to the application. This would causethe performance issue. So by altering the block sizes i.e. for the high performance node havelarge block size and low performance node have small block size. By tuning up the block sizes, all the nodes will have equal performance of read/writes. So that master node will not wait for other nodes for collecting the computations and it leads tobetter performance. There are some other contributions includes Changing the Hadoop Configuration Files, mapper reducer coordination (reducers need not to wait until all mappers finish their task), maximum tasks for a task tracker, networking in the rack architecture or core switch configuration in a cluster and many more works can be done to improve the performance of Hadoop framework. Note: I have given only one example parameter that is Disks Speed. Traditional Hard Disks speed could be MB/sec, with Solid state Disk that could run with a speed of MB/sec. 97 P a g e

2 Keywords: Benchmarks, Combiner, HDFS, HDFS Block Size, MAP REDUCE. I. INTRODUCTION This is an era of Big Data[1]. Where we are struggling to store and process such huge amount of data is the challenging thing. Big Data is simply defined as Data beyond the storage capacity and processing power. How we are getting such data? Here are some examples includes Facebook Data 500+ TB per day, Flight Sensor data 10TB per Half-n-Hour Journey, NYSE (New York Stock Exchange) and Files (Structured/Un-Structured/Semi Structured) etc. Big Data generally meant for Un-Structured / Semi Structured data where the data is in the form of plain logs. These three dimensions (commonly referred to as the 3 V s) can be understood as follows: Volume is the most obvious of the three, referring to the size of the data. The massive volumes of data that we are currently dealing with has required scientists to rethink storage and processing paradigms in order to develop the tools needed to properly analyze it. Velocity addresses the speed at which data can be received as well as analyzed. In the Data processing engines section, we discuss the differences between batch processing, which works on historical data, and stream processing, which analyzes the data in real-time as it is generated. This also refers to the rate of change of data, which is especially relevant in the area of stream processing. Variety refers to the issue of disparate and incompatible data formats. Data can come in from many different sources and take on many different forms, and just preparing it for analysis takes a significant amount of time and effort. How to handle such huge data to store and process? The best solution is Distributed storage and parallel processing called Hadoop. Apache Hadoop is a framework that allows a distributed processing of large data sets across cluster of commodity computers using a simple programming model. Hadoop is generally meant for Big Data. Some people think that Hadoop could do anything but an Open Source Distributed Processing Framework is not always a right choice in every situation. One needs to be carefully evaluate when to use Hadoop. Hadoop is suitable for huge amount of data in a single file rather small amount of data spread in multiple files. Hadoop is not suitable for real time applications. Hadoop built for specific class of applications that is write once read many times. 1.1 Hadoop s Approach: Hadoop [2] is a framework of tools used for supporting running of applications on Big Data. As it follows traditional approach called Distributed Data storage but the computation is Processor bound. It means a powerful computer will be act like a Server(see in fig 1.1). Big data is processed by this powerful enterprise computer which results in various problems like network constraints, latency, Cost etc. Hadoop Approach is similar but the beauty here is computation is also broken and result is combined at the end. No need of high end computers required, Commodity hardware is enough to process big data. Here Data Locality is preserved and the computation is moved towards the data which are stored in a distributed environment. 98 P a g e

3 BIG DATA Processed by POWERFUL COMPUTER Fig:1.1 Traditional Approach BIG DATA Combined Result B broken into Pieces Computation Commodity Hardware Fig:1.2 Hadoop s Approach 1.2 Hadoop Ecosystem Hadoop has two major componentshdfs [3] and Map Reduce [4]. HDFS (Hadoop Distributed File System) is used to store big data in a Distributed Environment, Map Reduce is a programming model used to process the distributed data parallel in a cluster. Hadoop works like Master Slave architecture with a cluster of commodity hardware as shown in the fig 1.1. We will discuss Hadoop related terminology now 1. HDFS Cluster: it is the name given to this whole configuration of master and slaves where the data is stored.we will discuss Master and Slave nodes in detail later in section Homogeneous Cluster: where all the nodes are of similar type of hardware configurations. 3. Heterogeneous Cluster: where all the nodes are of different type of hardware configurations like in a Cloud. 4. Rack: All the nodes are connected to a single LAN. 5. File System: On top of that distributed file system (UNIX), Hadoopprovides an API for processing all that stored data - Map-Reduce. II. HDFS FEATURES Highly Fault Tolerant: By replicating the data to a name node, as we are using commodity hardware so probability of failing is more. So to highly availability we need to replicate. Extreme low cost per byte: HDFS uses commodity direct attached storage and shares the cost of the network & computers it runs on with the MapReduce of the Hadoop stack. HDFS is open source software, so that if an organization chooses, it can be used with zero licensing and support costs. This cost advantage lets organizations store and process orders of magnitude more data per dollar than tradition SAN or NAS systems, which is the price point of many of these other systems. 99 P a g e

4 Very high bandwidth: HDFS can deliver data into the compute infrastructure at a huge data rate, which is often a requirement of big data workloads. HDFS can easily exceed 2 gigabits per second per computer into the mapreduce layer, on a very low cost shared network. Homogeneous/Heterogeneous HDFS Cluster: Disk Storage. Rack-1 Rack-2. Rack-n Fig 1.3: HDFS Cluster Setup 2.1 Hadoop Architecture Lets discuss the components of hadoop in detail and discuss the storage and processing hadoop map reduce jobs. The following are the hadoop s first generation components. Slaves Task Tracker: Process data on name node Data Node: Manage the piece of data Master Job Tracker: Break the bigger task into smaller pieces of Computation. Name Node: To manage and maintain Metadata. Secondary Name Node: Back up node These Master and Slave services are called Daemons run in background. Every master service can talk to each other and every slave service can talk to each other. Similarly name node can talk to data node and vice versa. Job Tracker can talk to Task Tracker and vice versa. All these communications are possible through heart beats. Name Node: Master of the System. Maintains and manages the blocks which are present on the data nodes. Data Node: Slaves which are deployed on each machine and provide the actual storage. It is responsible for serving read and write requests for the clients. Name Node should not be Commodity hardware. Client: is an application which is used to interact with both name node and data node. Job Tracker: send the computations to data nodes Task Tracker: process the data using map reduce programming model. 100 P a g e

5 Metadata Hadoop Cluster Master Name Node Client: Requests Job Tracker Submitting the job Empty DN Blocks Data Node R Data Node Data Node Data Node Slave Task Tracker Task Tracker Tas k Tracker Task Tracker Rack-1 Rack-2 Rack-3 Rack-4 R- Replications DN- Data Node Fig 1.4: Master Slave Architecture Client communicates Data Node and Name Node through SSH (Secure Shell). Minimum number of replicas is 3; it can be defined by user by changing dfs.replication file. Name Node stores Meta data in RAM, actual data is stored in Data Nodes. 2.2 HDFS Work Flow Step 1: DFS is a place where user copies the files (Log files, etc) Step 2: user submitting the job (some kind of analysis need to be done). Step1 DFS Input Splits Maps Reduce Step 3 step 4& 5 step 8 Step 2 Client USER step 9 Step 6 Job Tracker Step 7 Job Queue Fig 1.5: Job Tracker work Step 3: Client tries to get the information from the DFS where the files are located, which rack, which data block and etc., about the file (supplied by the Name Node). 101 P a g e

6 Step 4: Bigger job is split into smaller jobs using an interface called RecordReader. Step 5: Upload job information to DFS (job.xml, job.jar). Step 6: Job queue to the Job Tracker Step 7 and 8: Initialize job and Read job files Step 9: Create maps and reducers. Maps and Reduce are the programs can be written in programming language (JAVA) or scripting language (PYTHON, RUBY). Map processes the data first and generates the key value pair. On this reducer will work and gives output. Shuffle, Sort, Combiners and Partitioner are the intermediate phases available before the result sent to reducer. We will discuss in detail in the next section Map Reduce work flow The total file is divided blocks of 64MB by default. Let us say the input file is of 1GB, then HDFS will provide 16 blocks of each 64MB block. These 16 blocks are called Input Splits. The number of Map functions is equal to the number of input splits. Map programs gives the output key-value pair and can pass to Combiner. Combiner is a mini reducer performs shuffle and sort on a single mapper. Shuffle: places all the related key value pairs together. Sorting: make the key value pairs in sorting order. Reducer: after shuffle and sort the next pass is reducer, there can be one or more reducers. each reducer gives one output file named part Fig 1.6: Map Reduce work flow 102 P a g e

7 Partitioner: output of shuffle and sort phase key value pairs can be send to the intended reducer. This will enhance the performance of map reduce jobs. 2.4 Logicalviews of Map reduce flow The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map (k1,v1) list(k2,v2) The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key. The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce (k2, list (v2)) list(v3) Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list. III. PERFORMANCE ISSUES There are some areas where we can do research in improving the performance of hadoop jobs by tuning up various parameters like block sizes, sorting algorithms, data distribution algorithms etc. some of them are identified and tried to explain the solutions to achieve the benchmarks like tera sort and read write sort etc. 1. Sort and Shuffle Optimizations techniques are required to enhance the speed up factor in map reduce engine. 2. An efficient combiner that reduces the network overhead. 3. An efficient way to manage and maintain Meta data information by the name node. 4. By tuning up the block sizes, all the nodes will have equal performance of read/writes. 5. Better utilization of partitioner which enhances the performance by sending huge set of key value pairs to the intended reducer whose configuration is best in the heterogeneous network. 6. Self-learning capability to the nodes for immediately responds to the failure of a node then schedules the job to the nearest node in a cluster. So reducer waiting can be reduced. 7. A framework to find out the network topology in a cluster and can do trouble shooting that which data node is working slowly etc. 8. By altering the parameters of hadoop configuration files to achieve performance. 9. HDFS Block Placement Policy[5] 10. A framework to reach tera sort [6] and read write sort benchmarks. IV. TUNING THE PARAMETERS TO INCREASE THE PERFORMANCE 1. Sort and Shuffle Optimizations techniques are required to enhance the speed up factor in map reduce engine. Sorting algorithms to be designed such a way that in-memory based algorithms to be developed. External Sorting algorithms results wastage of network bandwidth by moving the key value pairs in 103 P a g e

8 between RAM and hard disk or towards HDFS in other nodes or other racks in a cluster. The goal is to compute the sorting in memory only. 2. Manage and maintain Meta data information by the name node: Small files are a big problem in Hadoop. A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you re storing small files, then you probably have lots of them (otherwise you wouldn t turn to Hadoop), and the problem is that HDFS can t handle lots of files. Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from data node to data node to retrieve each small file, all of which is an inefficient data access pattern. All these files Meta data is very difficult to maintain. As the Meta data is stored in RAM, even though the name node is not commodity type of hardware still face some problems in storing and maintaining. This is where we need to keep our eyes. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file. There are a couple of features to help reduce the bookkeeping overhead. Hadoop Archives (HAR files) were introduced to HDFS in to alleviate the problem of lots of files putting pressure on the name node s memory. Study different solutions and try to implement and work out for your own solution. 3. Tuning the block sizes: HDFS blocks are large, the default block size is 64MB compared to the block size in other structured-file system 4KB-8KB. The major contribution of this work is Hadoop performance evaluation on writing/reading with variable block sizes that results low performance and full understanding of Map Reduce concept as well and the distribution of jobs.at any stage hadoop clusters can run with a default block sizes, which is fixed (multiples of 64MB). Here reading/writing of such huge filecan vary from one node to other depends on commodity hardware. We are taking only one parameter that is Disks Speed. Traditions Hard Disks speed could be MB/sec, now we have come up with Solid state Disk that could run with a speed of MB/sec. That means the hard disk which have high performance will read the entire file and computes fast and others will compute slow. The Job Tracker should wait for allcomputations to return final result to the application. This would causethe performance issue. So by altering the block sizes i.e. for the high performance node havelarge block size and low performance node have small block size. By tuning up the block sizes, all the nodes will have equal performance of read/writes. So that master node will not wait for other nodes for collecting the computations and it leads tobetter performance. 2. Data Placement Policies: When the client requests for a few blocks, always the name node gives the free HDFS blocks which are nearer to it. Hence most of the requests will be received by the nearest node every time. hence reducing the performance of node by always serving to requests rather doing computation. We need to design data placement algorithms in such a way that requests to be sending to the nodes which are not only nearer but also considering some other metrics that all the nodes in a cluster should have equal number of HDFS blocks. 104 P a g e

9 V. CONCLUSION As we are living in a digital data world one should be aware of how to store and process such huge amount of data. We have defined the Hadoop framework for dealing Big Data in different sections under Introduction. Some of the research issues are defined like data placement policies, HDFS block size tuning, efficient sort and shuffle mechanism to be implemented, managing meta data efficiently etc. Some of the ideas are given to solve all these issues in brief are discussed under section 3. Our future work is to run experiments for all the issues defined individually in various papers and display results using performance measuring tools [7]. REFERENCES Journal Papers: [1] Wikipedia. Big Data. data. [3] HDFS User guide, August [4] Map/reduce tutorial, [5] Exploring HDFS Block Placement Policy, [6] Terasort, Books: [2] Hadoop the Definitive Guide O REILLY, Yahoo press, TOM White Theses: [7] PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE by PRATEEK GARG, IIT Bombay. 105 P a g e