Pig vs Hive: Benchmarking High Level Query Languages

Transcription

1 Pig vs Hive: Benchmarking High Level Query Languages Benjamin Jakobus IBM, Ireland Dr. Peter McBrien Imperial College London, UK Abstract This article presents benchmarking results 1 of two benchmarking sets (run on small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07 (which served as a baseline to compare major Pig Latin releases). The second results were obtained by applying the TPC-H benchmarks. The two benchmarks showed conflicting results; the first benchmark indicated that Pig outperformed Hive on most operations. However interestingly, TPC-H results provide evidence that Hive is significantly faster than Pig. The article analyzes the two benchmarks, concluding with a set of differences and justification of the results. The article presumes that the reader has a basic knowledge about Hadoop and big data. (The article is not intended as an introduction to Hadoop, Pig or Hive). 1 Which stem from 2013 when the author spent a year at Imperial College London 1

2 About the authors Benjamin Jakobus graduated with a BSc in Computer Science from University College Cork in 2011, after which he co-founded an Irish start-up. He returned to University one year later and graduated with an MSc in Advanced Computing from Imperial College London in Since graduating, he took up a position as Software Engineer at IBM Dublin (SWG, Collaboration Solutions). This article is based on his Masters thesis developed under the supervision of Dr. Peter McBrien. Dr. Peter McBrien graduated with a BA in Computer Science from Cambridge University in After some time working at Racal and ICL, hei joined the Department of Computing at Imperial College as an RA in 1989, working on the Tempora Esprit Project. He obtained his PhD Implementing Graph Rewriting By Graph Rewriting in 1992, under the supervision of Chris Hankin. In 1994, he joined Department of Computing at King s College London as a lecturer, and returned to the Department of Computing at Imperial College in August 1999 as a lecturer. Since then he has been promoted to be a Senior Lecturer. 2

3 Acknowledgements The authors would like to thank Yu Liu, PhD student at Imperial College London, who, over the course of the past year helped us with any technical problems that we encountered. 3

4 1 Introduction Despite Hadoop s popularity, the Hadoop user finds it cumbersome to develop Map-Reduce (MR). To simplify the task, high-level scripting languages such as Pig Latin or Hive QL have emerged. Users are often faced with the question whether to use Pig or Hive. At time of writing, no up-to-date scientific studies exist to help them answer this question. In addition, performance differences between Pig and Hive are not really well understood and not much literature in the field exists that examines these performance differences is scarce. The article presents benchmarking results 2 of two benchmarking sets (run on small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07. The second results were obtained by applying the TPC-H benchmarks. These test cases consist of 22 distinct queries, each of which exhibit the same (or higher) degree of complexity that is typically found in real-world industry scenarios, consist of varying query parameters and various types of access Whilst existing literature[7][10][6][4][11][13][8][12] addresses some of these questions, the literature suffer from the following shortcomings: 1. The most recent Apache Pig benchmarks stem from None of the literature cited in footnotes examines how operations scale over different datasets. 3. Hadoop benchmarks were performed on clusters of 100 nodes or less (Hadoop was designed to run on clusters containing thousands of nodes, therefore small-scale performance analysis may not really do it any justice). Naturally, the same argument can be applied against the benchmarking results presented in this article. 4. The literature fails to indicate the different communication overhead required by the various database management systems. (Again, this article does not address this concern; rather this article describes benchmark during runtime.) 2 Which stem from 2013 whilst the author spent a year at Imperial College London 4

5 2 Background: Benchmarking High-level Query Languages To date there exist several publications comparing the performance of Pig, HiveQL and other High-level Query Languages (HLQLs). In 2011, Stewart and Trinder et al[13] compared Pig, HiveQL and JAQL using runtime metrics, and according to how well each language scales and how much shorter queries really are in comparison to using the Hadoop Java API directly. Using a 32 node Beowulf cluster, Stewart et al found that: HiveQL scaled best (both up and out) and that Java was only slightly faster (It had the best runtime performance out of the three HLQLs). Java also had better scale-up performance than Pig. Pig is the most succinct and compact language of those compared. Pig and Hive QL are not Turing Complete. Pig and JAQL scaled the same except when using joins: Pig significantly outperformed JAQL on that regard. Pig and Hive are optimised for skewed key distribution and outperform hand-coded Java MR jobs in that regard. Hive s performance over Pig is further supported by Apache s Hive performance benchmarks[10]. Moussa[11] from the University of Tunis applied the TPC-H benchmark to compare Oracle SQL Engine to Pig. It was found that SQL Engine greatly outperformed Pig (whereby joins using Pig stood out to be particularly slow. Again, Apache s own benchmarks[10] confirm this: When executing a join, Hadoop took 470 seconds. Hive took 471 seconds. PIG took 764 seconds (Hive took 0.2% more time than Hadoop, whilst PIG took 63% more time than Hadoop). Moussa used a dataset of 1.1GB. While studying the performance of Pig using large astrophysical datasets Loebman et al[12] also found that a relational database management system outperforms Pig joins. In general, their experiments show that relational database management systems (RDBMSs) performed better than Hadoop and that relational databases especially stood out in terms of memory management (although that was to be expected given that NoSQL systems are designed to deal with unstructured rather than structured data). As acknowledged by the authors, it should be noted that no more than 8 nodes 5

6 were used throughout the experiment. Hadoop however is designed to be used with hundreds if not thousands of nodes. Work by Schätzle et al further underpins this argument: In 2011 the authors proposed PigSPARQL (a framework for translating SPARQL queries to Pig Latin) based on the reasoning that for scenarios, which can be characterized by first extracting information from a huge data set, second by transforming and loading the extracted data into a different format, cluster-based parallelism seems to outperform parallel databases. [4] Their reasoning is based on [5] [6], however the authors of [6] acknowledge that they cannot verify the claim that Hadoop would have outperformed the parallel database systems if only it had more nodes. That is, having benchmarked Hadoop s MapReduce with 100 nodes against two parallel database systems, it was found that both systems outperformed Hadoop: First, at the scale of the experiments we conducted, both parallel database systems displayed a significant performance advantage over Hadoop MR in executing a variety of data intensive analysis benchmarks. Averaged across all five tasks at 100 nodes, DBMS- X was 3.2 times faster than MR and Vertica was 2.3 times faster than DBMS-X. While we cannot verify this claim, we believe that the systems would have the same relative performance on 1,000 nodes (the largest Teradata configuration is less than 100 nodes managing over four petabytes of data). 3 Running the Apache Benchmark The experiment follows in the footsteps of the Pig benchmarks 3 published by the Apache Foundation on 11/07/07[7]. Their objective was to have baseline numbers to compare to before they could make major changes to the system. 3.1 Test Data We decided to benchmark the execution of load, arithmetic, group, join and filter operations on 6 datasets (as opposed to just two): Dataset size 1: 30,000 records (772KB) Dataset size 2: 300,000 records (6.4MB) Dataset size 3: 3,000,000 records (63MB) 3 With the exception of co-grouping. 6

7 Dataset size 4: 30 million records (628MB) Dataset size 5: 300 million records (6.2GB) Dataset size 6: 3 billion records (62GB) That is, our datasets scale linearly, whereby the size equates to 3000 * 10 n. A seventh dataset consisting of 1,000 records (23KB) was produced to perform join operations on. Its schema is as follows: name - string marks - integer gpa - float The data was generated using the generate data.pl perl script available for download on the Apache website.[7] and produced tab delimited text files with the following schema: name - string age - integer gpa - float It should be noted that the experiment differs slightly to the original in that the original used only two datasets of 200 million records (200MB) and 10 thousand (10KB) records whereas our experiment consists of six separate datasets with a scaling factor of 10 (i.e. 30,000 records, 300,000 records etc). 3.2 Test Setup The benchmarks were run on a cluster consisting of 6 nodes (1 dedicated to Name Node and Job Tracker and 5 compute nodes). Each node was equippted with a 2 dual-core Intel(R) Xeon(R) and 4 GB of memory. Furthermore, the cluster had Hadoop installed, configured to 1024MB memory and 2 map + 2 reduce jobs per node. We modified the original Apache Pig scripts to include the PARALLEL key word, forcing a total of 64 reduce tasks to be created (Pig was tested both with and without the PARALLEL key word). Both the Hive and Pig scripts were re-run in local mode on one of the cluster nodes. Hadoop was configured to use a replication factor of 2. 7

8 As with the original Apache benchmark, the Linux time utility was used to measure the average wall-clock time of each operation (operations were executed 3 times each). 3.3 Test Cases As with the original benchmark produced by Apache, we benchmarked the following operations 4 1. Loading and storing of data. 2. Filtering the data so that 10% of the records are removed. 3. Filtering the data so that 90% of the records are removed. 4. Performing basic arithmetic on integers and floats (age * gpa + 3, age / gpa - 1.5). 5. Grouping the data (by name). 6. Joining the data. 7. Distinct select. 8. Grouping the data. 3.4 Results Pig Benchmark Results Having determined the optimal number of reducers (8 is optimal in our case; see section 3.4.3), the results of the Pig benchmarks run on the Hadoop cluster are as follows: 4 Distinct selects and grouping were not part of the original benchmark. 8

9 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Arithmetic Filter 10% Filter 90% Group Join Table 1: Averaged performance of arithmetic, join, join, group, order, distinct select and filter operations on six datasets using Pig. Scripts were configured as to use 8 reduce and 11 map tasks. Figure 1: Pig runtime plotted in logarithmic scale Hive Benchmark Results Having determined the optimal number of reducers (8 is optimal in our case; see section 3.4.3), the results of the Hive benchmarks run on the Hadoop cluster are as follows: 9

10 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Arithmetic Filter 10% Filter 90% Group Join Distinct Table 2: Averaged performance of arithmetic, join, join, group, distinct select and filter operations on six datasets using Hive. Scripts were configured as to use 8 reduce and 11 map tasks Hive and Pig: JOIN benchmarks using variable number of reduce tasks The following results were obtained by varying the number of reduce tasks using the default JOIN for both Hive and Pig. All jobs were run on the aforementioned Hadoop cluster and each job was run three times. Runtimes were then averaged as seen below. It should be noted that Map time, Reduce time and Total time refers to the cluster s cumulative CPU time. Real time is the actual time as measured using the Unix time command (/usr/bin/time). It is this difference in the two time metrics that causes the discrepancy between the times in the tables below. That is, the CPU time required by a job running on 10 node cluster will (more or less) be the same than the time required to run the same job on a 1000 node cluster. However the real time it takes the job to complete on the 1000 node cluster will be 100 times less than if it were to run on a 10 node cluster. The JOIN scripts are as follows (with varied reduce tasks of course): Hive: set mapred. reduce. t a s k s =4; SELECT FROM d a t a s e t JOIN d a t a s e t j o i n ON d a t a s e t. name = d a t a s e t j o i n. name ; Pig: A = LOAD / user / bj112 / data /4/ dataset using PigStorage ( \ t ) AS (name : chararray, age : int, gpa : f l o a t ) PARALLEL 4 ; B = LOAD / user / bj112 / data / j o i n / d a t a s e t j o i n using PigStorage ( \ t ) AS (name : chararray, age : int, gpa : f l o a t ) PARALLEL 4 ; C = JOIN A BY name, B BY name PARALLEL 4 ; 10

11 STORE C INTO output using PigStorage ( ) PARALLEL 4 ; By default, Pig uses a Hash-join whilst Hive uses an equi-join. # Reducers # Map tasks Avg. real time Avg. map time Avg. reduce time Table 3: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. # R % time spent on map tasks % time spent on reduce tasks Avg. total cpu time Table 4: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. 11

12 # R Std. dev map time Std. dev red. time Std. total time dev Std. dev real time Table 5: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. # Reducers # Map tasks Avg. real time Avg. map time Avg. reduce time Table 6: Hive benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. 12

13 # R % time spent on map tasks % time spent on reduce tasks % 52.94% % 58.27% % 74.86% % 60.06% % 70.48% % 74.05% % 78.93% % 84.36% Avg. total cpu time Table 7: Hive benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. Figure 2: Hive reduce time (CPU time) plotted in logarithmic scale. 13

14 Figure 3: Pig reduce time (CPU time) plotted in logarithmic scale. # R Std. dev map time Std. dev red. time Std. total time dev Std. dev real time Table 8: Hive benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. The resultant join consisted of 37,987,763 lines (1.5GB). The original dataset used to perform the join consisted of 1,000 records; the dataset to which it was joined consisted of 30 million. Performing a replicated join using Pig and the same datasets resulted in a speed-up of 11% (the replicated join had an average real time runtime of 14

15 seconds compared to its hash-join equivalent of seconds). By adjusting the minimum and maximum split size and providing Hadoop with hints 5 as to how many map tasks should be used, we forced Hive to use 11 map tasks (the same as Pig) and arrived at the following results 6 : Map CPU Time Reduce CPU Time Total CPU Time Real Time Avg.: Std. Dev.: Table 9: Hive benchmarks for performing default JOIN operation on Dataset 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB) whilst using 8 reduce and 11 map tasks. 4 Running the TPC-H Benchmark As previously noted, the TPC-H benchmark was used to confirm the existence of a performance difference between Pig and Hive. TPC-H is a decision support benchmark published by the Transaction Processing Performance Council [9] (Transaction Processing Performance Council (TPC) is an organization founded for the purpose to define global database benchmarks). As stated in the official TPC-H specification:[14] [TPC-H] consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance while maintaining a sufficient degree of ease of implementation. This benchmark illustrates decision support systems that 5 The configuration was as follows: SET mapred.reduce.tasks=8; SET mapred.map.tasks=8; SET mapred.min.split.size= ; SET mapred.max.split.size= ; 6 Again, the JOIN was run three times and the result averaged 15

16 Examine large volumes of data; Execute queries with a high degree of complexity; Give answers to critical business questions. The performance metrics used for these benchmarks are the same as those used as part of the aforementioned Apache benchmarks: Real time runtime (using the Unix time command) Cumulative CPU time Map CPU time Reduce CPU time In addition, 4 new metrics were added: Number of map tasks launched Number of reduce tasks launched The TPC-H benchmarks differ from the Apache benchmarks (described earlier) and that a) they consist of more queries and b) the queries are more complex and intended to simulate a realistic business environment. 4.1 Test Data To recap section 3, we first attempted to replicate the Apache Pig benchmark published by the Apache Foundation on 11/07/07[7]. Consequently, the data was generated using the generate data.pl perl script available for download on the Apache website[7]. The Perl script produced tab delimited text files with the following schema: name - string age - integer gpa - float Six separate datasets were generated 7 in an order to measure the performance of, arithmetic, group, join and filter operations. The datasets scaled scaled linearly; therefore the size equates to 3000 * 10 n : dataset size 1 consisted of 30,000 records (772KB), dataset size 2 consisted of 300,000 records 7 These datasets were joined against seventh dataset consisting of 1,000 records (23KB) 16

17 (6.4MB), dataset size 3 consisted of 3,000,000 records (63MB), dataset size 4 consisted of 30 million records (628MB), dataset size 5 consisted of 300 million records (6.2GB) and dataset size 6 consisted of 3 billion records (62GB). One obvious downside to the above datasets is their simplicity: in reality, databases tend to be much more complex and most certainly consist of tables containing more than just three columns. Furthermore, databases usually don t just consist of one or two tables (the queries executed as part of the benchmarks from section 3 involved 2 tables at most. In fact all queries, except the join, involved only 1 table). The benchmarks produced within this report address these shortcomings by employing the much richer TPC-H datasets generated using the TPC dbgen utility. This utility produces 8 individual tables (customer.tbl consisting of 15,000,000 records (2.3GB), lineitem.tbl consisting of 600,037,902 records (75GB), nation.tbl consisting of 25 records (4KB, orders.tbl consisting of 150,000,000 records (17GB), partsupp.tbl consisting of 80,000,000 records (12GB), part.tbl consisting of 20,000,000 records (2.3GB), region.tbl consisting of 5 records (4KB), supplier.tbl consisting of 1,000,000 records (137MB). 4.2 Test Cases The TPC-H test cases consist of 22 distinct queries, each of which were designed to exhibit the same (or higher) degree of complexity that is typically found in real-world scenarios, consist of varying query parameters and various types of access. They are designed so that each query covers a large part of each table/dataset[14]. 4.3 Test Setup Several modifications have been made to the cluster since we ran the first set of experiments detailed in section 3, and the cluster on which the TPC-H benchmarks were run now consist of 9 hosts: chewbacca.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3822MiB system memory. Running Ubuntu LTS, Precise Pangolin. queen.doc.ic.ac.uk GHz Intel(R) Core(TM) i5 CPU, 7847MiB system memory. Running Ubuntu LTS, Precise Pangolin. 17

18 awake.doc.ic.ac.uk GHz Intel(R) Core(TM) i5 CPU, 7847MiB system memory. Running Ubuntu LTS, Precise Pangolin. mavolio.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 5712MiB system memory. Running Ubuntu LTS, Precise Pangolin. zim.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3824MiB system memory. Running Ubuntu LTS, Precise Pangolin. zorin.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3872MiB system memory. Running Ubuntu LTS, Precise Pangolin. tiffanycase.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3872MiB system memory. Running Ubuntu LTS, Precise Pangolin. zosimus.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3825MiB system memory. Running Ubuntu LTS, Precise Pangolin. artemis.doc.ic.ac.uk GHz Intel(R) Core(TM) i5 CPU, 7847MiB system memory. Running Ubuntu LTS, Precise Pangolin. Both the Hive and Pig TPC-H scripts are available for download from the Apache website. Section 4.6 presents additions to the set of benchmarks from section 3. The datasets and scripts used are identical to those presented in the section 3. Note: The Linux time utility was used to measure the average wall-clock time of each operation. For other metrics (CPU time, heap usage, etc) the Hadoop logs were used. 18

19 Results This section presents the results for both Pig and Hive. Hive (TPC-H) Running the TPC-H benchmarks for Hive produced the following results (note: script names were abbreviated): Running the TPC-H benchmarks for Hive produced the following results: Script Avg. runtime Std. dev. Avg. cumulative CPU time Avg. map tasks Avg. reduce tasks q q q q q q q q q q q q q q q q q q q q q q Table 10: TPC-H benchmark results for Hive using 6 trials (time is in seconds, unless indicated otherwise). 19

20 Script Avg. map heap usage Avg. reduce heap usage Avg.total heap usage Avg. map CPU time Avg. reduce CPU time Avg. total CPU time q q q q4 N/A N/A N/A N/A N/A N/A q5 N/A N/A N/A N/A N/A N/A q6 N/A N/A N/A N/A N/A N/A q q q q q q q q q15 N/A N/A N/A N/A N/A N/A q q17 N/A N/A N/A N/A N/A N/A q18 N/A N/A N/A N/A N/A N/A q q20 N/A N/A N/A N/A N/A N/A q21 N/A N/A N/A N/A N/A N/A q Table 11: TPC-H benchmark results for Hive using 6 trials. 20

21 Figure 4: Real time runtimes of all 22 TPC-H benchmark scripts for Hive. 4.4 Pig (TPC-H) Running the TPC-H benchmarks for Pig produced the following results (note: script names were abbreviated): 21

22 Script Avg. Std. dev. runtime q q q q q q q q q q q q q q q q q q q q q q Table 12: TPC-H benchmark results for Pig using 6 trials (time is in seconds, unless indicated otherwise). 22

23 Script Avg. map heap usage Avg. reduce heap usage Avg.total heap usage Avg. map CPU time Avg. reduce CPU time Avg. total CPU time q q q q q q6 N/A N/A N/A N/A N/A N/A q q q q10 N/A N/A N/A N/A N/A N/A q q q q q q q q q q q q Table 13: TPC-H benchmark results for Pig using 6 trials. 23

24 Figure 5: Real time runtimes of all 22 TPC-H benchmark scripts for Pig. Hive vs Pig (TPC-H) As shown in figure 7, Hive outperforms Pig in the majority of cases (12 to be precise). Their performance is roughly equivalent for 3 cases and Pig outperforms Hive in 6 cases. At first glance, this contradicts all results of the experiments from section 3. 24

25 Figure 6: Real time runtimes of all 22 TPC-H benchmark scripts contrasted. Upon examining the TPC-H benchmarks more closely, two issues stood out that explain this discrepancy. The first is that after a script writes results to disk, the output files are immediately deleted using Hadoop s fs -rmr command. This process is quite costly and is measured as part the real-time execution of the script (however the fact that this operation is expensive (in terms of runtime) is not catered for). In contrast, the HiveQL scripts merely drop tables at the beginning of the script. Dropping tables is cheap as it only involves manipulating the meta-information on the local filesystem - no interaction with the Hadoop filesystem is required. In fact, omitting the recursive delete operation reduces runtime by about 2%. In contrast, removing DROP TABLE in Hive does not produce any performance difference. The aforementioned issue only accounts for a small percent of inequality. What causes the actual performance difference is the heavy usage of the Group By operator in all but 3 TPC-H test scripts. Recall from section 3 that Pig outperformed Hive in all instances except when using the Group By operator: when grouping data Pig was 104% slower than Hive. 25

26 Figure 7: The runtime comparison between Pig and Hive (plotted in logarithmic scale) for the Group By operator based on the benchmarks from section 3. For example when running the TPC-H benchmarks for Pig, script 21 (q21 suppliers who kept orders waiting.pig) had a real-time runtime of seconds. 41% (or seconds) were required to execute the first Group By. In contrast, Hive only required seconds for the grouping of data. The script grouped data 3 times: -- This Group By took up 41% of the runtime gl = group lineitem by l_orderkey; [...] fo = filter orders by o_orderstatus == F ; [...] ores = order sres by numwait desc, s_name; Consequently, the excessive use of the Group By operator skews the benchmark results significantly. Re-running the scripts and omitting the the group- 26

27 ing of data produces the expected results. For example, running script 3 (q3 shipping priority.pig) and omitting the Group By operator significantly reduces the runtime (to seconds real time runtime or a total of 12,257,630ms CPU time). Figure 8: The total average heap usage (in bytes) of all 22 TPC-H benchmark scripts contrasted. Another interesting artefact is exposed by figure 8: In all instances, Hive s heap usage is significantly lower than that of Pig. This might be explained by the fact that Hive does not need to build intermediary data structure, but Pig (at the time of writing) does. 4.5 Determining optimal cluster configuration Manipulating the configuration of the benchmark from section 3 in an effort to determine optimal cluster usage produced interesting results. For one, data compression is important and significantly impacts runtime performance of JOIN and GROUP BY operations in Pig. For example, 27

28 enabling compression on dataset size 4 (which contains a large amount of random data) produces a 3.2% speed-up in real time runtime. Compression in Pig can be enabled by setting the pig.tmpfilecompression flag to true and then specifying the type of compression pig.tmpfilecompression.codec to either gzip or lzo. Note that gzip produces better compression whilst LZO is much faster in terms of runtime. By editing the entry for mapred.reduce.slowstart.completed.maps in Hadoop s conf/mapred-site.xml we can tune the percentage of map tasks that must be completed before reduce tasks can be created. By default, this value is set to 5% which was found to be too low for our cluster. Balancing the ratio of mappers and reducers is critical to optimizing performance: reducers should be started early enough so that data transfer is spread out over time and thus preventing network bottlenecks. On the other hand, reducers shouldn t be started late enough so that they do not use up slots that could be used by map tasks. Performance peaked when reduce tasks were started after 70% of map jobs completed. The maximum number of map and reduce tasks for a node can be specified using mapred.tasktracker.map.tasks.maximum and mapred.tasktracker. reduce.tasks.maximum. Naturally care should be taken when configuring these: having a node with a maximum of 20 map slots but a script configured to use 30 map slots will result in significant performance penalties as the first 20 map tasks will run in parallel, but the additional 10 will only be spawned once the first 20 map tasks have completed execution (consequently requiring one extra round of computation). The same goes for the number of reduce tasks: as is illustrated by figure 9, performance peaks when a task requires just little below the maximum number of reduce slots per node. 28

29 Figure 9: Real time runtimes contrasted with a variable number of reducers for join operations in Pig. 4.6 A small addition to section 3 - CPU runtimes One outstanding item that our first set of results failed to report was the contrast between real time runtime and CPU runtime. As expected, cumulative CPU runtime was higher than real time runtime (since tasks are distributed between nodes). 29

30 Figure 10: Real time runtime contrasted with CPU runtime for the Pig scripts run on dataset size 5. 5 Conclusion Of specific interest was the finding that Pig consistently outperformed Hive (with the exception of grouping data). Specifically: For arithmetic operations, Pig is 46% faster (on average) than Hive For filtering 10% of the data, Pig is 49% faster (on average) than Hive For filtering 90% of the data, Pig is 18% faster (on average) than Hive For joining datasets, Pig is 36% faster (on average) than Hive This conflicted with existing literature that found Hive to outperform Pig: In 2009, Apache s own performance benchmarks found that Pig was significantly slower than Hive. These findings were validated in 2011 by Stewart and Trinder et al who also found that Hive map-reduce jobs outperformed those produced by the Pig compiler. When forced to equal terms (that is, when forcing Hive to use the same number of mappers as Pig), Hive remains 67% slower than Pig when comparing 30

31 real time runtime (i.e. it takes Pig roughly 1/3 of the time to compute the JOIN. That is, increasing the number of map tasks in Hive from 4 to 11 only resulted in a 13% speed-up. It should also be noted that the performance difference between Pig and Hive does not scale linearly. That is, initially there is little difference in performance (this is due to the large start-up costs). However as the datasets increase in size, Hive becomes consistently slower (to the point of crashing when attempting to join large datasets). To conclude, the discussed experiments allowed for the answering of 4 core questions: 1. How do Pig and Hive perform as other Hadoop properties are varied (e.g. number of map tasks)? Balancing the ratio of mappers and reducers has a big impact on real time runtime and consequently is critical to optimizing performance: reducers should be started early enough so that data transfer is spread out sufficiently to prevent network congestions. On the other hand, reducers shouldn t be started so late that they do not use up slots that could be used by map tasks. Care should also be taken when setting the maximum allowable map and reduce slots per node. For example having a node with a maximum of 20 map slots but a script configured to use 30 map slots will result in significant performance penalties, because the first 20 map tasks will run in parallel, but the additional 10 will only be spawned once the first 20 map tasks have completed execution (consequently requiring one extra round of computation). The same goes for the number of reduce tasks: as is illustrated by figure 9. Performance peaks when a task requires a number of reduce slots per node that falls just below the maximum number. 2. Do more complex datasets and queries (e.g. TPC-H benchmarks) yield the same results than the Apache benchmarks from 11/07/07? At first glance, running the TPC-H benchmarks contradicts the Apache benchmark results. In nearly all instances, Hive outperforms Pig. However closer examination revealed that nearly all TPC-H scripts relied heavily on the Group By operator, an operator which appears to be poorly implemented in Pig. Using the Group By operator greatly degrades the performance of Pig Latin scripts. The TPC-H benchmark results might be less relevant to your decision process if the grouping will be not be a dominant feature for your application. (Because operators are not evenly distributed throughout the scripts: if one operator is poorly implemented, then this will 31

32 skew the entire result set - as can be seen in section 4.4) 3. How does real time runtime scale with regards to CPU runtime? As expected given the cluster configuration (9 nodes). The real time runtime was between 15%-20% of the cumulative CPU runtime. 4. What should the ratio of map and reduce tasks be? The ratio for map and reduce tasks can be configured through mapred.reduce.slowstart.completed.maps Hadoop s conf/mapred-site.xml. The default value of 0.05 (i.e. 5%) was found to be too low. The optimal for our cluster was at about 70%. It should also be noted that the use of the Group By operator within the TPC-H benchmarks skews results significantly (recall the Apache benchmarks that showed that Pig outperformed Hive in all instances except when using the Group By operator: when grouping data Pig was 104% slower than Hive). Re-running the scripts and omitting the the grouping of data produces the expected results. For example, running script 3 (q3 shipping priority.pig) and ommitting the Group By operator significantly reduces the runtime (to seconds real time runtime or a total of 12,257,630ms CPU time). As already noted in the introduction, Hadoop was designed to run on clusters containing hundreds / thousands of nodes, therefore running smallscale performance analysis may not really do it any justice. Ideally the benchmarks presented in this article should be run on much larger clusters. References [1] Stewart Robert J.; Trinder P.; Loidl H. (2011), Comparing High Level MapReduce Query Languages,. Springer Berlin Heidelberg, Advanced Parallel Processing Technologies, pages [2] Moussa, R. (2012), TPC-H Benchmarking of Pig Latin on a Hadoop Cluster,. Communications and Information Technology (ICCIT), 2012 International Conference, pages [3] Loebman S.; Nunley D.; Kwon Y.; Howe B.; Balazinska M.; Gardner. J.P. (2012), Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?,. Cin Proc. of CLUSTER. 2009, pages

33 [4] Schätzle A.; Przyjaciel-Zablocki M.; Hornung T.; Lausen G. (2011), PigSPARQL: Übersetzung von SPARQL nach Pig Latin,. Proc. BTW, pages [5] Lin J.; Dyer C. (2010) Data-intensive text processing with MapReduce,. Synthesis Lectures on Human Language Technologies, pages [6] Pavlo A.; Paulson E.; Rasin A.; Abadi D. J.; DeWitt D. J.; Madden S.; Stonebraker. M. (2009) A comparison of approaches to large-scale data analysis,. In Proc. SIGMOD, ACM, pages [7] DBPedias. (2013), Pig Performance Benchmarks,. and Visited 15/01/2013 [8] Gates, Alan F. and Natkovich, Olga and Chopra, Shubham and Kamath, Pradeep and Narayanamurthy, Shravan M. and Olston, Christopher and Reed, Benjamin and Srinivasan, Santhosh and Srivastava, Utkarsh (2009), Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience,. Proc. VLDB Endow., pages [9] Transaction Processing Council Transaction Processing Council Website,. Visited 18/06/2013 [10] Apache Software Foundation. (2009), Hive, PIG, Hadoop benchmark results,. benchmark pdf, Visited 03/01/2013 [11] Moussa, R. (2012), TPC-H Benchmarking of Pig Latin on a Hadoop Cluster,. Communications and Information Technology (ICCIT), 2012 International Conference, pages [12] Loebman S.; Nunley D.; Kwon Y.; Howe B.; Balazinska M.; Gardner. J.P. (2012), Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?,. Cin Proc. of CLUSTER. 2009, pages [13] Stewart Robert J.; Trinder P.; Loidl H. (2011), Comparing High Level MapReduce Query Languages,. Springer Berlin Heidelberg, Advanced Parallel Processing Technologies, pages [14] Transaction Processing Performance Council (TPC) (2013), TPC Benchmark H,. Standard Specification, Revision , Transaction Processing Performance Council (TPC), Presidio of San Francisco 33