Enhancing MapReduce via Asynchronous Data Processing

Transcription

1 Enhancing MapReduce via Asynchronous Data Processing Marwa Elteir, Heshan Lin and Wu-chun Feng Departent of Coputer Science Virginia Tech {aelteir, hlin2, Abstract The MapReduce prograing odel siplifies large-scale data processing on coodity clusters by having users specify a ap function that processes input key/value pairs to generate interediate key/value pairs, and a reduce function that erges and converts interediate key/value pairs into final results. Typical MapReduce ipleentations such as Hadoop enforce barrier synchronization between the ap and reduce phases, i.e., the reduce phase does not start until all ap tasks are finished. In turn, this synchronization requireent can cause inefficient utilization of coputing resources and can adversely ipact perforance. Thus, we present and evaluate two different approaches to cope with the synchronization drawback of existing MapReduce ipleentations. The first approach, hierarchical reduction, starts a reduce task as soon as a predefined nuber of ap tasks copletes; it then aggregates the results of different reduce tasks following a tree structure. The second approach, increental reduction, starts a predefined nuber of reduce tasks fro the beginning and has each reduce task increentally reduce records collected fro ap tasks. Together with our perforance odeling, we evaluate different reducing approaches with two real applications on a 32-node cluster. The experiental results have shown that increental reduction outperfors hierarchical reduction in general. Also, increental reduction can speed-up the original Hadoop ipleentation by up to 35.33% for the wordcount application and 57.98% for the grep application. In addition, increental reduction outperfors the original Hadoop in an eulated cloud environent with heterogeneous copute nodes. Index Ters Distributed Coputing, Cloud Coputing, MapReduce, Hadoop, Asynchronous processing. I. INTRODUCTION Today, any coercial and scientific applications require the processing of large aounts of data, and thus, deand copute resources far beyond what can be provided by a single coodity processor. To address this, various high-end coputing platfors including supercoputers, graphics processors, clusters of workstations, and cloud coputing environents have been architected to facilitate parallel processing at different scales. However, efficiently exploiting the hardware parallelis provided by these platfors requires parallel prograing skills that are difficult to aster by coon application developers. Even for well-trained experts, the engineering and debugging of parallel applications can be tie consuing. MapReduce [4] is an effort that seeks to deocratize parallel prograing. Originally proposed by Google, MapReduce has been used to process assive aounts of data in web search engines on a daily basis. With the aturity of Hadoop, an open-source MapReduce ipleentation, the MapReduce prograing odel has been widely adopted and found effective in any scientific and coercial application areas such as achine learning [2], bioinforatics [8], astrophysics [7] and cyber-security [5]. Because of its ease of use and built-in fault tolerance, MapReduce is also becoing one of the ost effective prograing odels in cloud coputing environents, e.g., Aazon EC2 [1]. A MapReduce job consists of two user-provided functions: ap and reduce. As shown in Figure 1, the run-tie syste creates concurrent instances of ap tasks that split and process the input data. The interediate results generated by ap tasks are then copied to the reduce tasks, which in turn reduce the interediate results and produce the final output. In a MapReduce syste, all the above data is stored as key/value pairs for efficient data indexing and partitioning. Input dataset Map phase r1 r2 Reduce phase Map task r Reduce task Distributed File Syste Fig. 1. MapReduce fraework Local File Syste Existing MapReduce ipleentations (e.g., Hadoop) enforce a barrier synchronization between the ap phase and the reduce phase, i.e., the reduce phase only starts when all ap tasks are copleted. There are two cases where this barrier synchronization can result in serious resource underutilization. First, on heterogeneous environents, the faster copute nodes will finish their assigned ap tasks earlier, but these nodes cannot proceed to the reduce processing until all the ap tasks are finished, thus wasting resources. The resource heterogeneity can originate fro different hardware

2 configurations or resource sharing through virtualization. 1 Second, even in hoogeneous environents, a copute node ay not be fully utilized by the ap processing because a ap task alternates between coputation and I/O. Also, there is additional scheduling overhead between the execution of different tasks. In this case, overlapping ap and reduce processing can considerably iprove job response tie, especially when the nuber of ap tasks is relatively large copared to the cluster size. For instance, the nuber of ap tasks of a MapReduce job is typically configured to be 1 ties the nuber of copute nodes at Google [4]. In this paper, we focus on addressing the above synchronization proble for a specific class of MapReduce jobs, called recursively reducible MapReduce jobs. For this type of MapReduce jobs, a portion of the ap results can be reduced independently, and the partial reduced results can be recursively aggregated to produce global reduce results. One exaple of such a job is word counting. More details about recursively reducible MapReduce jobs will be discussed in Section II-B. The unique characteristic of recursively reducible MapReduce jobs is that there is no inherent synchronization requireent between the ap phase and the reduce phase. Consequently, we propose and copare two asynchronous data-processing techniques to enhance resource utilization and perforance of MapReduce for recursively reducible jobs. The first approach, hierarchical reduction (HR), overlaps ap and reduce processing at the inter-task level. This approach starts a reduce task as soon as a certain nuber of ap tasks coplete and aggregates partial reduced results following a tree hierarchy. The second approach, increental reduction (), exploits the potential of overlapping data processing and counication within each reduce task. It starts a designated nuber of reduce tasks fro the beginning and increentally applies reduce function to the interediate results accuulated fro ap tasks. We evaluate the proposed approaches with analytical odels and experients on a 32-node cluster. The experiental results deonstrate that both approaches can effectively iprove the MapReduce execution tie with the increental reduction approach consistently outperforing hierarchical reduction. In particular, increental reduction can outperfor the original Hadoop ipleentation by 35.33% for the wordcount application and 57.98% for the grep application. The rest of the paper is organized as follows. Section II provides background inforation about Hadoop and recursively reducible MapReduce jobs. Section III presents the design of the proposed approaches, in particular, hierarchical reduction and increental reduction. Section IV evaluates the perforance of the proposed approaches using an analytical odel. The experiental results are discussed in Section V. Section VI presents the related work. Finally, Section VII concludes the paper. 1 Zahria et al. reported that there can be a 2.5-fold perforance difference aong various virtual achine instances on Aazon EC2 [13]. II. BACKGROUND Here we describe Hadoop, an open-source Java ipleentation of the MapReduce fraework, as well as the notion of recursively reducible jobs. A. Hadoop Hadoop can be logically segregated into two subsystes, i.e., a distributed file syste called H and a MapReduce run-tie syste. The MapReduce run-tie syste follows a aster-slave design. The aster node is responsible for anaging subitted jobs, i.e., assigning ap and reduce tasks of every job to the available workers. By default, each worker can run two ap tasks and two reduce tasks siultaneously. At the beginning of a job execution, the input data is split and assigned to individual ap tasks. When a worker finishes executing a ap task, it stores the ap results as interediate key/value pairs locally. The interediate results of each ap task will be partitioned and assigned to the reduce tasks according to their keys. A reduce task begins by retrieving its corresponding interediate results fro all ap outputs (called the shuffle phase). The reduce task then sorts the collected interediate results and applies the reduce function to the sorted results. To iprove perforance, Hadoop overlaps the retrieving and sorting of finished ap outputs with the execution of newly scheduled ap tasks. B. Recursively Reducible Jobs Word counting is a siple exaple of recursively reducible jobs. The occurrences of a word can be counted first on different splits of an input file, and those partial counts can then be aggregated to produce the nuber of word occurrences in the entire file. Other recursively reducible MapReduce applications include association rule ining, outlier detection, coutative and associative statistical functions, and so on. In contrast, the square of the su of values is an exaple of a reduce function that is not recursively reducible, because (a +b) 2 + (c + d) 2 does not equal (a +b+c+d) 2. However, there are soe atheatical approaches that can transfor such functions to benefit fro our solution. It is worth entioning that there is a cobiner function provided in typical MapReduce ipleentations including Hadoop. The cobiner function is used to reduce key/value pairs generated by a single ap task. The partially reduced results, instead of the raw ap output, are delivered to the reduce tasks for further reduction. Our proposed asynchronous data-processing techniques are applicable to all applications that can benefit fro the cobiner function. The fundaental difference between our techniques and the cobiner function is that our techniques optiize the reducing of key/value pairs fro ultiple ap tasks. III. ASYNCHRONOUS MAPREDUCE DATA PROCESSING In this section, we present the design details of our two proposed asynchronous data-processing techniques: hierarchical reduction and increental reduction.

3 A. Hierarchical Reduction (HR) 1) Design and Ipleentation: Hierarchical reduction seeks to overlap the ap and reduce processing by dynaically issuing reduce tasks to aggregate partially reduced results along a tree-like hierarchy. As shown in Figure 2, as soon as a certain nuber (i.e., defined by the aggregation level σ H ) of ap tasks are successfully copleted, a new reduce task is created and assigned to one of the available workers. This reduce task is responsible for reducing the output of the σ H ap tasks that are just finished. When all ap tasks are successfully copleted and assigned to reduce tasks, another stage of the reduce phase is started. In this stage, as soon as a certain σ H reduce tasks are successfully copleted, a new reduce task is created to reduce the output of the σ H reduce tasks. This process repeats until there is only one reaining reduce task, i.e., when all interediate results are reduced. B. Increental Reduction () 1) Design and Ipleentation: Increental reduction ais to start the reduce phase as early as possible within a reduce task. Specifically, the nuber of reduce tasks are defined at the beginning of the job siilar to the original MapReduce fraework. Within a reduce task, as soon as a certain aount of ap outputs are received, the reduction of these outputs starts and the results are stored locally. The sae process repeats until all ap outputs are retrieved. Map phase Reduce phase r1 r2 r3 r4 Write final output 7 8 Map phase Reduce phase r1 r2 r3 r4 r5 r7 Map task Reduce task Assigned ap output r5 Map task Reduce task Fig r5 r8 Write final output Hierarchical reduction with an aggregation level of two. Although conceptually the reduce tasks are organized as a balanced tree, in our ipleentation a reduce task at a given level does not have to wait for all of the tasks at the previous level to finish. In other words, as soon as a sufficient nuber of tasks (i.e., σ H ) fro the previous level becoes available, a reduce task fro the subsequent level can begin. Such a design can reduce the associated scheduling overhead of HR. 2) Discussion: One advantage of HR is that it can parallelize the reduction of a single reducing key across ultiple workers, whereas in the original MapReduce fraework, the reduction of a key is always handled by one worker. Therefore, this approach is suitable for applications with significant reduce coputation per key. However, HR incurs extra counication overhead in transferring the interediate key/value pairs to reduce tasks at different levels of the tree hierarchy, which can adversely ipact the perforance as the depth of the tree hierarchy increases. Other overheads include the scheduling cost of reduce tasks generated on the fly. The fault-tolerant design of the original Hadoop needs to be odified to accoodate HR. In particular, the JobTracker should keep track of all created reduce tasks, in addition to the tasks assigned to be reduced by these reduce tasks. Whenever a reduce task fails, another copy of this task should be created, and the appropriate tasks should be assigned again for reduction. Fig. 3. Reduce coputation Increental reduction with an reduce granularity of two. In Hadoop, a reduce task consists of three stages. The first stage, naed shuffling, copies the task s own portion of interediate results fro the output of all ap tasks. The second stage, naed sorting, sorts and erges the retrieved interediate results, according to their keys. Finally, the third stage applies the reduce function to the values associated with each key. To enhance the perforance of the reduce phase, the shuffling stage is overlapped with the sorting stage. More specifically, when the nuber of in-eory ap outputs reaches a certain threshold, apred.ine.erge.threshold, these outputs are erged and the results are stored on-disk. When the nuber of on-disk files reaches another threshold, io.sort.factor, another on-disk erge is perfored. After all ap outputs are retrieved, all on-disk and in-eory files are erged, and then the reduction stage begins. In our ipleentation, we ake use of io.sort.factor and apred.ine.erge.threshold. When the nuber of ineory outputs reaches the apred.ine.erge.threshold threshold, they are erged and the erging results are stored on the disk. When the nuber of on-disk outputs reaches the io.sort.factor threshold, the increental reduction of these outputs begins and the reducing results are stored instead of the erging results. When all ap outputs are retrieved, the in-eory ap outputs are reduced along with the stored reducing results. The final output data is written to the distributed file syste. The entire process is depicted in Figure 3. 2) Discussion: incurs less overheads than HR for two reasons. First, the interediate key/value pairs are transitted

4 once fro the ap to reduce tasks instead of several ties along the hierarchy. Second, all of the reduce tasks are created at the start of the job, and hence, the scheduling overhead is reduced. In addition to the counication cost, the nuber of writes to local and distributed file syste are the sae (assuing the sae nuber of reduce tasks) for both the original MapReduce and. Therefore, can outperfors the original MapReduce when there is sufficient overlap between the ap and reduce processing. The ain challenge of is to select the right threshold that triggers an increental reduce operation. Too low a threshold will result in unnecessarily frequent I/O operations, while too high a threshold will not be able to deliver noticeable perforance iproveents. Interestingly, a siilar decision, i.e., the erging threshold, has to be ade in the original Hadoop ipleentation as well. Currently we provide a runtie option for users to control the increental reduction threshold. In the future, we plan to investigate self-tuning of this threshold for long-running MapReduce jobs. It is worth noting that since the ap and reduce tasks in this approach are created in the sae anner as in Hadoop, the fault-tolerant schee of Hadoop works in as well. IV. ANALYTICAL MODELS Here we derive analytical odels to copare the perforance of the original MapReduce () ipleentation of Hadoop and the augented ipleentations with hierarchical reduction (HR) and increental reduction () enhanceents. Table I presents all the paraeters used in the odels. Without loss of generality, our odeling assues the nuber of reduce tasks r is saller than the nuber of execution slots 2n. In fact, the Hadoop docuentation recoends that 95% of the execution slots is a good nuber for the nuber of reduce tasks for typical applications. However, our analysis can be easily generalized to odel the cases where there are ore reduce tasks than the nuber of execution slots. Furtherore, the odel assues that the nuber of ap tasks is greater than the nuber of available execution slots in the cluster 2n (recall that there are two execution slots per node by default). When the nuber of ap tasks is less than the nuber of execution slots, all ap tasks are executed in parallel and copleted siultaneously, so there is no way to overlap ap and reduce phases. For siplicity, we consider two different cases. The first case assues that there is a high degree of overlap between the ap and reduce procedures, where the processing (including retrieving, erging, and sorting) of alost all interediate results in a reduce task can be overlapped with ap coputations. The second case, representing a low degree of overlap between the ap and reduce procedures, assues that the processing of only a portion of interediate results in a reduce task can be overlapped with ap coputations. A. The execution of the original Hadoop can be illustrated by the left part of Figure 4. When the overlapping degree is high, Paraeters n k r t t rk σ H C C C HR Meaning Nuber of ap tasks Nuber of nodes Total nuber of interediate key/ value pairs Nuber of reduce tasks of the fraework Average ap task execution tie Average execution tie of reducing values of a single key Aggregation level used in HR Counication cost per key/value pair Counication cost fro ap tasks to r reduce tasks in Counication cost fro the assigned σ H ap tasks to a reduce task in HR TABLE I PARAMETERS USED IN THE PERFORMANCE MODEL the erging phase of can be eliinated. So, the total tie becoes Map tie + Reduce tie, assuing the final stage erging is neglected. First stage of ap tasks Overlapped coputations Write to disk Decrease in total tie Map phase Merge phase Reduce phase Copy and erge Fig. 4. Copy, erge, and reduce Execution of and For low degree of overlap, if the reduce tasks occupy all nodes i.e., the nuber of reduce tasks (r) is larger than or equal to n, only a portion of the erging of the interediate results can be overlapped with the ap coputation, and the total tie can be expressed by Equation (1), where o represents the reduction in the erging tie. T = Map tie + (Merge tie o) + Reduce tie (1) However, when the reduce tasks do not occupy all nodes, ore erging can be overlapped with the ap coputations due to the load balancing effect i.e., the nodes with running reduce tasks process a saller nuber of ap tasks copared to the other nodes. This is because Hadoop uses greedy task scheduling; the nodes without running reduce tasks process ap tasks faster, and in turn, are assigned ore ap tasks. As a result, the Map tie is increased and the erging tie is decreased as shown in Equation (2), where l represents the load balancing effect, o represents the overlapping effect, and o > o. As r increases, l and o keeps decreasing, until reaching and o respectively when r = n (Equation (1)). T = (Map tie+l)+(merge tie o )+Reduce tie (2)

5 B. HR For hierarchical reduction, ap and reduce processing at different stages can be overlapped, as shown in Figure 5. To copare the perforance of HR with that of, we consider ore detailed odeling. For HR, when all ap tasks are finished, the reaining coputations are to reduce the un-reduced ap tasks in addition to cobining the results of this reducing stage with other partial reduced results. Specifically, the total execution tie of, and HR can be represented by the following equations, where C is the counication required to retrieve the reaining ap outputs since s counication is overlapped with the ap coputations, and s is the reaining nuber of reduce stages in the HR s hierarchy: T = Map tie + C + k r log( k r ) + t rk k r (3) T HR = Map tie+(c HR + σ Hk log(σ Hk )+t rk σ Hk ) s (4) Map phase t σ k σ k CHR + ( tr + log( )) Fig. 5. Reduce phase s Stage 1 Stage 2 Stage 3 Stage 4 Execution of HR fraework when = 8n Assuing every ap task produces a value for each given key, then C k 2n is r C, then C HR is σhk C, and C HR equals σhr 2n C, where C is counication cost per key/value pair. By substituting C HR in Equation (4) by the previous value, the equation becoes: T HR = Map tie+ ( σ Hr 2n C + σ H (klog(σ Hk ) + kt (5) rk)) s When the overlapping degree is high, s can be replaced by (log σh (2n) + 1) in Equation (5), which represents the reduction of the ap tasks of the final stage. Moreover, the erging tie can be eliinated fro Equation (3). So, for significantly large, the reducing part of Equation (5) is saller than that of Equation (3). If this ter occupies a significant portion of the total tie of, and the counication overhead is sall, then HR will behave better than as we will see in the experiental evaluation. However, when the overlapping degree is low, s can be very deep, and the perforance of HR can be worse than. C. The execution of relative to is represented by Figure 4. Particularly, when the overlapping degree is high, the final erging, and reducing phase of can be eliinated. So the total execution tie can be represented by Map tie. By coparing this to, we can conclude that behaves better than especially when the reducing tie of is significant. For low overlapping degree, if r is larger than n, then erging and reducing of only a portion of the interediate results can be overlapped with the ap coputations, and the total tie can be expressed by Equation (6). Where o and o r represents the reduction in the erging and reducing tie respectively. T = Map tie + (Merge tie o ) + (Reduce tie o r ) (6) To copare this with Equation (2), we consider the details of the overlapping coputations in and. The ain difference is that perfors reducing after erging and writes the results of reduce rather than the results of erge to disk. Assuing the reduce function changes the size of the input by a factor of x and the reduce function is linear, then the overlapped coputations of (O ) and (O ) can be represented by the following equations, where I is the size of interediate key/value pairs to be erged and reduced during the ap phase, IlogI is the average nuber of copare operations executed during the erge, p s is the processor speed, d s is the disk write speed. O = I logi p s O = I logi + I p s + I d s (7) + I x d s (8) Given the sae overlapping degree, if x < 1, which is valid for applications like wordcount, grep, linear regression etc., then is able to conduct ore erging in addition to reducing overlapped with ap coputations. So, the erging and reducing ters in equation 6 is less than the sae ters in equation 1. So, can behave better than given the reduce coputations is signifiant as illustrated by Figure 4. On the other side, if x 1 as in sort application, then the perforance of highly depends on the coplexity of the reduce function copared to the erging, in addition to the size of the interediate key/value pairs. By applying the previous analysis to the case where r < n, we can conclude that can behave better than in this case given also the reducing tie occupies a significant portion of the total execution tie. V. EXPERIMENTAL ANALYSIS In this section, we present a rigorous perforance evaluation of our proposed techniques. The details of the syste configurations are given along with the conducted experients.

6 A. Experiental Platfor We ran our experients on Syste X at Virginia Tech, coprising Apple Xserve G5 copute nodes with dual 2.3GHz PowerPC 97FX processors, 4GB RAM, and 8GB hard drives. The copute nodes are connected with a Gigabit Ethernet interconnection. Each node is running the GNU/Linux operating syste with kernel version The proposed approaches are developed based on Hadoop.17.. B. Overview We use two applications i.e., wordcount and grep in the experients. Wordcount is an application that parses a docuent or a nuber of docuents, and produces for every word the nuber of its occurrences. Grep accepts a docuent or a nuber of docuents and an expression as input, and it atches this expression along the whole docuents and produces the nuber of occurrences for every atch. We first ai at studying the scalability of the different reducing approaches in ters of the dataset size. Next, we seek to understand the reasons of perforance difference between different approaches by profiling the behaviors of wordcount and grep under various configurations. Finally, we evaluate various reducing approaches on eulated cloud environents. The ajor perforance etric in all experients is the total execution tie in seconds. Moreover, for a fair coparison, we follow the guidance given in the Hadoop docuentation regarding the nuber of ap and reduce slots per node as well as the thresholds that control the frequency of erging and sorting interediate results (i.e., apred.ine.erge.threshold and io.sort.factor as discussed in Section III-B). Furtherore, we enable the cobiner function in all experients In addition, the aggregation level of the hierarchical reduction approach is set to 4, which produces the best results on the 32-node cluster. Before executing an experient, we flush the Linux file syste cache by having each node read a duy file that is larger than the size of the eory. This avoids the I/O perforance inconsistency caused by the Linux file caching. C. Scalability with the Dataset Size In this experient, we ai at studying the scalability of the three reducing approaches with regard to the size of the input dataset. We run wordcount and grep using data sets of two different sizes, i.e., 16GB and 64GB. For wordcount, the nuber of reduce tasks was set to 4 and 8 (a broader range were used in Section V-D). For grep, we use a query that produces results of oderate size. The perforance of queries generating various sizes of output will be investigated in Section V-E. As shown in Figure 6, generally, as the size of the input dataset increases, the perforance iproveent of over increases. Specifically, for wordcount, as the size of the input increases to 64GB, outperfors by 34.5% and 9.48% instead of 21% and 5.97% for 16GB input using 4 and 8 reduce tasks, respectively. In addition, for grep, increasing the input size to 64GB iproves s perforance gain over ; is better than by 16.2% (for 64GB input) instead Noralized execution tie Fig GB 64GB 16GB 64GB 16GB 64GB 16GB 64GB aggregation level Wordcount Setting [a-c]+[a-z.]+' Grep Scalability with dataset size using wordcount and grep of 7.1% (for 16GB input). The scalability of attributes to two reasons that provide ore roo for overlapping ap processing and reduce coputations. First, as the dataset size increases, the ap phase has to perfor ore I/O. Second, for 64 GB, the nuber of ap tasks increases fro 256 to 124, thus increasing the scheduling overhead. Siilarly, the perforance iproveent of HR over increases by 9.96% for wordcount as the size of input dataset increases. However, the speedup of HR over decreases (by 4.13%) for grep when larger input is used. This is because the extra counication cost caused by the increase of input data can be copensated by the corresponding increase in the ap processing tie in wordcount but not grep. Since the noral Mapreduce jobs process large aounts of input data, all of the subsequent experients will use input datasets of 64 GB. D. Wordcount Perforance In a cluster of 32 nodes, we run wordcount on a dataset of size 64 GB. The nuber of ap tasks is set to 124, and the nuber of reduce task is varied fro 1 to 64. As shown in Figure 7, as the nuber of reduce tasks increases, s iproveent over decreases. Specifically, for one reduce task, behaves better than by 35.33%. When the nuber of reduce tasks is increased to 4, behaves better by 34.49%. As the nuber of reduce tasks increases, the processing tie of a reduce task decreases, thus providing little roo for overlapping the ap and reduce processing. Specifically, when the nuber of reduce tasks is 32, a reduce task only consue a ere 6.83% in the total execution tie as shown in table II. Total execution tie (seconds) Fig. 7. Nuber of reduce tasks Settings aggregation level Perforance of vs. using wordcount HR HR

7 Reduce Tasks Increental erges Reduce Tie Map Tie TABLE II AND EXECUTION PROFILE. Reduce Tasks S S S S TABLE III NUMBER OF MAP TASKS EXECUTED ON A CODE WITH REDUCE TASK(S) RUNNING. S MEANS A SPECULATIVE TASK. Furtherore, achieves its best perforance at 4 reduce tasks because this provides the best coproise between level of parallelis controlled by the nuber of reduce tasks and overlapping ap with reduce. Specifically, in this case, conducts 55 increental erges overlapped with the ap coputations copared to erges with 32 reduce tasks, as shown in Table II. As a result, the nodes executing a reduce task executes 26 ap tasks instead of 32 ap tasks in case of 32 reduce tasks, recall the load balancing effect discussed in section IV. Specifically, when only a sall portion of the nodes are executing reduce tasks, these nodes will have less resources to spend on the ap processing, and thus ore ap tasks are pushed off to the other nodes without reduce tasks. The best perforance is achieved at 32 and 4 reduce tasks for, and respectively. With the best perforance for both and, is better by 5.86%. On the other side, using an aggregation level of 4, HR behaves better than with 8 reduce tasks by 5.13%. We changed the aggregation level fro 2 to 8 as shown in Figure 7. The best perforance is achieved for the aggregation level 4, when a best balance is achieved between the overlap of ap and reduce processing and the reduce overhead. For exaple, with an aggregation level of 2, a reduce operation can be triggered sooner, but the reducing hierarchy is also deeper. Fig. 8. Overall CPU utilization Nuber of reduce tasks CPU utilization throughout the whole job using wordcount To better understand the benefits of the increental reduction approach, we easured the CPU utilization throughout the job execution, and the nuber of disk transfers per second during the ap phase for both and. As shown in Concurrent jobs Execution Tie Execution Tie (seconds) (seconds) TABLE IV AND PERFORMANCE WITH CONCURRENT JOBS. Figure 8 and 9, the CPU utilization of is greater than by 5% on the average. In addition, the average disk transfers of is less than by 2.95 transfers per second. This attributes to the saller aount of data written to disk by, because it reduces the interediate data before writing it back to disk. In doing so, also reduces the size of data read fro disk at the final erging and reducing stage. Transfers per seconds Nuber of reduce tasks Fig. 9. Nuber of disk transfers per second through the ap phase using wordcount. To conclude, for any nuber of reduce tasks, achieves either better or sae perforance as. And the best perforance for is achieved using only 4 reduce tasks, This eans that is ore efficient in utilizing the available resources. So, we expect to achieve better perforance when several jobs are running at the sae tie, or with larger aounts of reduce processing. Particularly, when running three concurrent jobs of wordcount, the best configuration of behaves better than the best configuration of by 8.1% instead of 5.86% as shown in table IV. E. Grep Perforance In a cluster of 32 nodes, we run grep on a dataset of 64 GB. The nuber of ap tasks is set to 124, and the nuber of reduce tasks is set to the default value i.e., one. Grep runs two consecutive jobs; one returns for each atch the nuber of its occurrence, and the other is a short job that inverts the output of the previous job so that the final output will be sorted based on occurrence of the atches instead of alphabetically. In this experient, we focus on the first longer job. We used five different queries each produces a different nuber of atches and hence different nubers of interediate and final key/value pairs. As shown in Figure 1, and HR deliver very siilar perforance. In addition, for the first query, all reducing approaches have the sae perforance. For subsequent queries, HR and outperfor, and the perforance iproveent of HR/ over increases along with the nuber of atches.

8 Total execution tie (seconds) Fig. 1. a+[a-z.]+' [a-b]+[a-z.]+' [a-c]+[a-z.]+' [a-d]+[a-z.]+' [a-i]+[a-z.]+' Query Perforance of,, and HR using grep. Query Reduce Tie Interediate Data Size (seconds) (records) a+[a-z.] ,285,68 [a-b]+[a-z.] ,897,216 [a-c]+[a-z.] ,196,736 [a-d]+[a-z.] ,39,36 [a-i]+[a-z.] ,921,472 TABLE V CHARACTERISTICS OF DIFFERENT QUERIES HR Total execution tie (seconds) Hoogenous 1 slow - fast 1 slow 32 slow reduce Setting Fig. 11. Wordcount perforance in hoogeneous and heterogeneous environents. In the second setting, 1 nodes are slowed down, and the reduce tasks are scheduled on the fast nodes. Currently, where reduce tasks will be run in HR is not controllable. ance because of virtualization, we slow down 32 nodes and repeat the experient. As we can see, still outperfors by 7.14%. HR F. Heterogeneous Environent Perforance Nowadays data centers are becoing increentally heterogeneous, due to the use of virtualization and/or achines fro different generations. In this experient, we ai at studying the robustness of, HR, and to the heterogeneity of the target cluster. In a cluster of 32 nodes, we anually slow down several nodes i.e., 1 nodes to iic a heterogeneous cluster We continuously run dd coand to convert and write a large file (e.g. 5.7 GB) to disk in order to slow down a given node. This approach was used in a related study by Zahria et al. [13]. We expect in these environents, the ap phase tie gets longer due to the effects of the slow nodes. So, if the reduce tasks are appropriately assigned to the fast nodes, then utilizing the extra ap tie in reduce coputations could iprove the perforance of the proposed approaches. Using wordcount, we run, and with the best configuration achieved in Section V-D i.e., 32 reduce tasks for and 4 reduce tasks for. As shown in Figure 11, when the reduce tasks are assigned to the fast nodes, becoes better than by 1.33% instead of 5.86%. However, when they are randoly assigned, becoes better than by only 2.32%. This is expected because the I/O and coputing resources available for reduce tasks in this case are liited, preventing fro taking advantage of overlap between the ap and reduce processing. We argue that if the heterogeneity originates fro different generations of hardware or fro virtualization, it is possible to identify the fast nodes and assign ore reduce tasks to these nodes. Note that HR s perforance drops significantly when running in the heterogeneous environent. This can be attributed to the large nuber of generated reduce tasks in HR. In addition, it is indeterinistic where these tasks will be run, so it is not possible to avoid the effect of the slow nodes. To siulate a cloud environent with slower I/O perfor- VI. RELATED WORK Several research efforts have been done to enhance the MapReduce fraework. Sawzall is a prograing language built on top of MapReduce [9]. It ais at autoatically analyzing huge distributed data files. The ain difference between Sawzall and the standalone Mapreduce fraework is that Sawzall distributes the reduction in a hierarchical topology-based anner. In [1], the authors iproved the perforance of Google s MapReduce fraework by pipelining disk and network I/O. Particularly, they aied at streaing interediate data as it is generated and uses local storage as a write-ahead log for network transfer. In [12], the authors believe that the original MapReduce fraework is liited in supporting applications like relational data processing. So they presented a odified version of the MapReduce naed MapReduceMerge, where the reduce workers produce a list of key/values pairs that are transitted to a set of erge workers for further processing. Moreover, Valvag et al. developed a high-level declarative prograing odel and its underlying runtie, Oivos, which ais at handling applications that require running several MapReduce jobs [11]. This fraework has two ain advantages copared with MapReduce. First, it reduces the overheads associated with such type of applications that includes onitoring the status and progress of each job, deterining when to re-execute a failed job or start the next one, and specifying a valid execution order for the MapReduce jobs. Second, it reoves the extra synchronization when these applications are executed using the traditional MapReduce fraework, i.e., every reduce task in one job should coplete before any of the ap tasks in the next job can start. Steve et al. realized that the loss of interediate ap outputs ay result in a significant perforance degradation [6]. And although using H (Hadoop Distributed File Syste) iproves the reliability, it results in considerably increase in the job copletion tie. As a result, they proposed soe

9 design ideas for a new interediate data storage syste. Zahria et al. [13] proposed a speculative task scheduling naed LATE (Longest Approxiate Tie to End) to cope with several liitations of the original Hadoop s scheduler in heterogeneous environents such as Aazon EC2[1]. Finally, Condie et al. [3] recently extended the MapReduce architecture to work efficiently for online jobs in addition to batch jobs. Instead of aterializing the interediate key/value pairs within every ap task, they proposed pipelining these data directly to the reduce tasks. They further extended this pipelined MapReduce to support interactive data analysis through online aggregation, and continuous query processing. To conclude, our proposed solutions are orthogonal to the above extensions to MapReduce fraework and can copleent their iproveents. VII. CONCLUSIONS AND FUTURE WORK In this paper, we designed and ipleented two approaches to reduce the overhead of the barrier synchronization between the ap and reduce phases of Hadoop, an open source ipleentation of MapReduce. In addition, we evaluated the perforance of these approaches against Hadoop using analytical odel and experients on a 32-node cluster. The first proposed approach is the hierarchical reduction, which overlaps ap and reduce processing at the inter-task level. It starts a reduce task as soon as a certain nuber of ap tasks coplete and aggregates partial reduced results following a tree hierarchy. This approach can be effective when there is enough overlap between ap and reduce processing. However, this approach has soe liitations due to the overheads of creating reduce tasks on the fly, in addition to the extra counication cost of transferring the interediate results along the tree hierarchy. To cope with these overheads, we proposed the increental reduction approach, where all reduce tasks are created at the start of the job, and every reduce task increentally reduces the received ap outputs. The experiental results have shown that this approach consistently outperfors the hierarchical reduction approach and the original Hadoop ipleentation. In the future, we plan to evaluate our techniques to a broader class of applications. We will also investigate in generalizing our techniques to support applications which are not recursively reducible in their original for. Finally, we will evaluate the perforance of the proposed techniques on real cloud environents. at both in-eory erge and on-disk erge. We also present an alternative design, i.e., hierarchical reduction. Finally, in addition to our proposed designs, we presented a rigorous perforance evaluation and analytical analysis of increental reduction () and hierarchical reduction (HR). ACKNOWLEDGEMENT This research is supported in part by NSF grant CNS REFERENCES [1] Aazon.co. Aazon elastic copute cloud. [2] S. Chen and S. Schlosser. Map-reduce eets wider varieties of applications eets wider varieties of applications. Technical Report P-TR-8-5, Intel Research, 28. [3] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Eleleegy, and Russell Sears. Mapreduce online. Technical Report UCB/EECS , University of California at Berkeley, 29. [4] Jeffrey Dean and Sanjay Gheawat. Mapreduce: Siplified data processing on large clusters. In 6th Syposiu on Operating Systes, Design & Ipleentation, OSDI, 24. [5] M. Grant, S. Sehrish, J. Bent, and J. Wang. Introducing ap-reduce to high end coputing. 3rd Petascale Data Storage Workshop, Nov 28. [6] S. Ko, I. Hoque, B. Cho, and I. Gupta. On Availability of Interediate Data in Cloud Coputations. In 12th Workshop on Hot Topics in Operating Systes (HotOS XII), 29. [7] G. Mackey, S. Sehrish, J. Bent, J. Lopez, S. Habib, and J. Wang. Introducing apreduce to high end coputing. In Petascale Data Storage Workshop Held in conjunction with SC8, 28. [8] A. Matsunaga, M. Tsugawa, and J. Fortes. Cloudblast: Cobining apreduce and virtualization on distributed resources for bioinforatics. Microsoft escience Workshop, 28. [9] Rob Pike, Sean Dorward, Robert Grieseer, and Sean Quinlan. Interpreting the data: Parallel analysis with sawzall. Sci. Progra., 13(4): , 25. [1] Alex Rasussen and Maxius Daehan Choi. Iproving perforance in apreduce through pipelining. 26. [11] S.V. Valvag and D. Johansen. Oivos: Siple and efficient distributed data processing. In High Perforance Coputing and Counications, 28. HPCC 8. 1th IEEE International Conference on, pages , Sept. 28. [12] Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Map-reduce-erge: siplified relational data processing on large clusters. In SIGMOD 7: Proceedings of the 27 ACM SIGMOD international conference on Manageent of data, pages , New York, NY, USA, 27. ACM. [13] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica. Iproving apreduce perforance in heterogeneous environents. In OSDI: USENIX Syposiu on Operating Systes Design and Ipleentation, 28. DISCLAIMER This paper is based on version.17. of Hadoop. We copleted the ajor design and developent of our approaches in June 29. Recently, we found that in concurrence with our project developent, Hadoop extended the ipleentation of the original cobiner function, using a design siilar to the increental reduction () approach proposed in our paper. In the ost recent version of Hadoop, the cobiner function is triggered in a reduce task during the in-eory erge of ap outputs. In our design, the increental reduction occurs