Pig vs Hive: Benchmarking High Level Query Languages

Size: px
Start display at page:

Download "Pig vs Hive: Benchmarking High Level Query Languages"

Transcription

1 Pig vs Hive: Benchmarking High Level Query Languages Benjamin Jakobus IBM, Ireland Dr. Peter McBrien Imperial College London, UK Abstract This article presents benchmarking results 1 of two benchmarking sets (run on small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07 (which served as a baseline to compare major Pig Latin releases). The second results were obtained by applying the TPC-H benchmarks. The two benchmarks showed conflicting results; the first benchmark indicated that Pig outperformed Hive on most operations. However interestingly, TPC-H results provide evidence that Hive is significantly faster than Pig. The article analyzes the two benchmarks, concluding with a set of differences and justification of the results. The article presumes that the reader has a basic knowledge about Hadoop and big data. (The article is not intended as an introduction to Hadoop, Pig or Hive). 1 Which stem from 2013 when the author spent a year at Imperial College London 1

2 About the authors Benjamin Jakobus graduated with a BSc in Computer Science from University College Cork in 2011, after which he co-founded an Irish start-up. He returned to University one year later and graduated with an MSc in Advanced Computing from Imperial College London in Since graduating, he took up a position as Software Engineer at IBM Dublin (SWG, Collaboration Solutions). This article is based on his Masters thesis developed under the supervision of Dr. Peter McBrien. Dr. Peter McBrien graduated with a BA in Computer Science from Cambridge University in After some time working at Racal and ICL, hei joined the Department of Computing at Imperial College as an RA in 1989, working on the Tempora Esprit Project. He obtained his PhD Implementing Graph Rewriting By Graph Rewriting in 1992, under the supervision of Chris Hankin. In 1994, he joined Department of Computing at King s College London as a lecturer, and returned to the Department of Computing at Imperial College in August 1999 as a lecturer. Since then he has been promoted to be a Senior Lecturer. 2

3 Acknowledgements The authors would like to thank Yu Liu, PhD student at Imperial College London, who, over the course of the past year helped us with any technical problems that we encountered. 3

4 1 Introduction Despite Hadoop s popularity, the Hadoop user finds it cumbersome to develop Map-Reduce (MR). To simplify the task, high-level scripting languages such as Pig Latin or Hive QL have emerged. Users are often faced with the question whether to use Pig or Hive. At time of writing, no up-to-date scientific studies exist to help them answer this question. In addition, performance differences between Pig and Hive are not really well understood and not much literature in the field exists that examines these performance differences is scarce. The article presents benchmarking results 2 of two benchmarking sets (run on small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07. The second results were obtained by applying the TPC-H benchmarks. These test cases consist of 22 distinct queries, each of which exhibit the same (or higher) degree of complexity that is typically found in real-world industry scenarios, consist of varying query parameters and various types of access Whilst existing literature[7][10][6][4][11][13][8][12] addresses some of these questions, the literature suffer from the following shortcomings: 1. The most recent Apache Pig benchmarks stem from None of the literature cited in footnotes examines how operations scale over different datasets. 3. Hadoop benchmarks were performed on clusters of 100 nodes or less (Hadoop was designed to run on clusters containing thousands of nodes, therefore small-scale performance analysis may not really do it any justice). Naturally, the same argument can be applied against the benchmarking results presented in this article. 4. The literature fails to indicate the different communication overhead required by the various database management systems. (Again, this article does not address this concern; rather this article describes benchmark during runtime.) 2 Which stem from 2013 whilst the author spent a year at Imperial College London 4

5 2 Background: Benchmarking High-level Query Languages To date there exist several publications comparing the performance of Pig, HiveQL and other High-level Query Languages (HLQLs). In 2011, Stewart and Trinder et al[13] compared Pig, HiveQL and JAQL using runtime metrics, and according to how well each language scales and how much shorter queries really are in comparison to using the Hadoop Java API directly. Using a 32 node Beowulf cluster, Stewart et al found that: HiveQL scaled best (both up and out) and that Java was only slightly faster (It had the best runtime performance out of the three HLQLs). Java also had better scale-up performance than Pig. Pig is the most succinct and compact language of those compared. Pig and Hive QL are not Turing Complete. Pig and JAQL scaled the same except when using joins: Pig significantly outperformed JAQL on that regard. Pig and Hive are optimised for skewed key distribution and outperform hand-coded Java MR jobs in that regard. Hive s performance over Pig is further supported by Apache s Hive performance benchmarks[10]. Moussa[11] from the University of Tunis applied the TPC-H benchmark to compare Oracle SQL Engine to Pig. It was found that SQL Engine greatly outperformed Pig (whereby joins using Pig stood out to be particularly slow. Again, Apache s own benchmarks[10] confirm this: When executing a join, Hadoop took 470 seconds. Hive took 471 seconds. PIG took 764 seconds (Hive took 0.2% more time than Hadoop, whilst PIG took 63% more time than Hadoop). Moussa used a dataset of 1.1GB. While studying the performance of Pig using large astrophysical datasets Loebman et al[12] also found that a relational database management system outperforms Pig joins. In general, their experiments show that relational database management systems (RDBMSs) performed better than Hadoop and that relational databases especially stood out in terms of memory management (although that was to be expected given that NoSQL systems are designed to deal with unstructured rather than structured data). As acknowledged by the authors, it should be noted that no more than 8 nodes 5

6 were used throughout the experiment. Hadoop however is designed to be used with hundreds if not thousands of nodes. Work by Schätzle et al further underpins this argument: In 2011 the authors proposed PigSPARQL (a framework for translating SPARQL queries to Pig Latin) based on the reasoning that for scenarios, which can be characterized by first extracting information from a huge data set, second by transforming and loading the extracted data into a different format, cluster-based parallelism seems to outperform parallel databases. [4] Their reasoning is based on [5] [6], however the authors of [6] acknowledge that they cannot verify the claim that Hadoop would have outperformed the parallel database systems if only it had more nodes. That is, having benchmarked Hadoop s MapReduce with 100 nodes against two parallel database systems, it was found that both systems outperformed Hadoop: First, at the scale of the experiments we conducted, both parallel database systems displayed a significant performance advantage over Hadoop MR in executing a variety of data intensive analysis benchmarks. Averaged across all five tasks at 100 nodes, DBMS- X was 3.2 times faster than MR and Vertica was 2.3 times faster than DBMS-X. While we cannot verify this claim, we believe that the systems would have the same relative performance on 1,000 nodes (the largest Teradata configuration is less than 100 nodes managing over four petabytes of data). 3 Running the Apache Benchmark The experiment follows in the footsteps of the Pig benchmarks 3 published by the Apache Foundation on 11/07/07[7]. Their objective was to have baseline numbers to compare to before they could make major changes to the system. 3.1 Test Data We decided to benchmark the execution of load, arithmetic, group, join and filter operations on 6 datasets (as opposed to just two): Dataset size 1: 30,000 records (772KB) Dataset size 2: 300,000 records (6.4MB) Dataset size 3: 3,000,000 records (63MB) 3 With the exception of co-grouping. 6

7 Dataset size 4: 30 million records (628MB) Dataset size 5: 300 million records (6.2GB) Dataset size 6: 3 billion records (62GB) That is, our datasets scale linearly, whereby the size equates to 3000 * 10 n. A seventh dataset consisting of 1,000 records (23KB) was produced to perform join operations on. Its schema is as follows: name - string marks - integer gpa - float The data was generated using the generate data.pl perl script available for download on the Apache website.[7] and produced tab delimited text files with the following schema: name - string age - integer gpa - float It should be noted that the experiment differs slightly to the original in that the original used only two datasets of 200 million records (200MB) and 10 thousand (10KB) records whereas our experiment consists of six separate datasets with a scaling factor of 10 (i.e. 30,000 records, 300,000 records etc). 3.2 Test Setup The benchmarks were run on a cluster consisting of 6 nodes (1 dedicated to Name Node and Job Tracker and 5 compute nodes). Each node was equippted with a 2 dual-core Intel(R) Xeon(R) and 4 GB of memory. Furthermore, the cluster had Hadoop installed, configured to 1024MB memory and 2 map + 2 reduce jobs per node. We modified the original Apache Pig scripts to include the PARALLEL key word, forcing a total of 64 reduce tasks to be created (Pig was tested both with and without the PARALLEL key word). Both the Hive and Pig scripts were re-run in local mode on one of the cluster nodes. Hadoop was configured to use a replication factor of 2. 7

8 As with the original Apache benchmark, the Linux time utility was used to measure the average wall-clock time of each operation (operations were executed 3 times each). 3.3 Test Cases As with the original benchmark produced by Apache, we benchmarked the following operations 4 1. Loading and storing of data. 2. Filtering the data so that 10% of the records are removed. 3. Filtering the data so that 90% of the records are removed. 4. Performing basic arithmetic on integers and floats (age * gpa + 3, age / gpa - 1.5). 5. Grouping the data (by name). 6. Joining the data. 7. Distinct select. 8. Grouping the data. 3.4 Results Pig Benchmark Results Having determined the optimal number of reducers (8 is optimal in our case; see section 3.4.3), the results of the Pig benchmarks run on the Hadoop cluster are as follows: 4 Distinct selects and grouping were not part of the original benchmark. 8

9 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Arithmetic Filter 10% Filter 90% Group Join Table 1: Averaged performance of arithmetic, join, join, group, order, distinct select and filter operations on six datasets using Pig. Scripts were configured as to use 8 reduce and 11 map tasks. Figure 1: Pig runtime plotted in logarithmic scale Hive Benchmark Results Having determined the optimal number of reducers (8 is optimal in our case; see section 3.4.3), the results of the Hive benchmarks run on the Hadoop cluster are as follows: 9

10 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Arithmetic Filter 10% Filter 90% Group Join Distinct Table 2: Averaged performance of arithmetic, join, join, group, distinct select and filter operations on six datasets using Hive. Scripts were configured as to use 8 reduce and 11 map tasks Hive and Pig: JOIN benchmarks using variable number of reduce tasks The following results were obtained by varying the number of reduce tasks using the default JOIN for both Hive and Pig. All jobs were run on the aforementioned Hadoop cluster and each job was run three times. Runtimes were then averaged as seen below. It should be noted that Map time, Reduce time and Total time refers to the cluster s cumulative CPU time. Real time is the actual time as measured using the Unix time command (/usr/bin/time). It is this difference in the two time metrics that causes the discrepancy between the times in the tables below. That is, the CPU time required by a job running on 10 node cluster will (more or less) be the same than the time required to run the same job on a 1000 node cluster. However the real time it takes the job to complete on the 1000 node cluster will be 100 times less than if it were to run on a 10 node cluster. The JOIN scripts are as follows (with varied reduce tasks of course): Hive: set mapred. reduce. t a s k s =4; SELECT FROM d a t a s e t JOIN d a t a s e t j o i n ON d a t a s e t. name = d a t a s e t j o i n. name ; Pig: A = LOAD / user / bj112 / data /4/ dataset using PigStorage ( \ t ) AS (name : chararray, age : int, gpa : f l o a t ) PARALLEL 4 ; B = LOAD / user / bj112 / data / j o i n / d a t a s e t j o i n using PigStorage ( \ t ) AS (name : chararray, age : int, gpa : f l o a t ) PARALLEL 4 ; C = JOIN A BY name, B BY name PARALLEL 4 ; 10

11 STORE C INTO output using PigStorage ( ) PARALLEL 4 ; By default, Pig uses a Hash-join whilst Hive uses an equi-join. # Reducers # Map tasks Avg. real time Avg. map time Avg. reduce time Table 3: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. # R % time spent on map tasks % time spent on reduce tasks Avg. total cpu time Table 4: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. 11

12 # R Std. dev map time Std. dev red. time Std. total time dev Std. dev real time Table 5: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. # Reducers # Map tasks Avg. real time Avg. map time Avg. reduce time Table 6: Hive benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. 12

13 # R % time spent on map tasks % time spent on reduce tasks % 52.94% % 58.27% % 74.86% % 60.06% % 70.48% % 74.05% % 78.93% % 84.36% Avg. total cpu time Table 7: Hive benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. Figure 2: Hive reduce time (CPU time) plotted in logarithmic scale. 13

14 Figure 3: Pig reduce time (CPU time) plotted in logarithmic scale. # R Std. dev map time Std. dev red. time Std. total time dev Std. dev real time Table 8: Hive benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB). See section 3.1 for info on the datasets. The resultant join consisted of 37,987,763 lines (1.5GB). The original dataset used to perform the join consisted of 1,000 records; the dataset to which it was joined consisted of 30 million. Performing a replicated join using Pig and the same datasets resulted in a speed-up of 11% (the replicated join had an average real time runtime of 14

15 seconds compared to its hash-join equivalent of seconds). By adjusting the minimum and maximum split size and providing Hadoop with hints 5 as to how many map tasks should be used, we forced Hive to use 11 map tasks (the same as Pig) and arrived at the following results 6 : Map CPU Time Reduce CPU Time Total CPU Time Real Time Avg.: Std. Dev.: Table 9: Hive benchmarks for performing default JOIN operation on Dataset 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB) whilst using 8 reduce and 11 map tasks. 4 Running the TPC-H Benchmark As previously noted, the TPC-H benchmark was used to confirm the existence of a performance difference between Pig and Hive. TPC-H is a decision support benchmark published by the Transaction Processing Performance Council [9] (Transaction Processing Performance Council (TPC) is an organization founded for the purpose to define global database benchmarks). As stated in the official TPC-H specification:[14] [TPC-H] consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance while maintaining a sufficient degree of ease of implementation. This benchmark illustrates decision support systems that 5 The configuration was as follows: SET mapred.reduce.tasks=8; SET mapred.map.tasks=8; SET mapred.min.split.size= ; SET mapred.max.split.size= ; 6 Again, the JOIN was run three times and the result averaged 15

16 Examine large volumes of data; Execute queries with a high degree of complexity; Give answers to critical business questions. The performance metrics used for these benchmarks are the same as those used as part of the aforementioned Apache benchmarks: Real time runtime (using the Unix time command) Cumulative CPU time Map CPU time Reduce CPU time In addition, 4 new metrics were added: Number of map tasks launched Number of reduce tasks launched The TPC-H benchmarks differ from the Apache benchmarks (described earlier) and that a) they consist of more queries and b) the queries are more complex and intended to simulate a realistic business environment. 4.1 Test Data To recap section 3, we first attempted to replicate the Apache Pig benchmark published by the Apache Foundation on 11/07/07[7]. Consequently, the data was generated using the generate data.pl perl script available for download on the Apache website[7]. The Perl script produced tab delimited text files with the following schema: name - string age - integer gpa - float Six separate datasets were generated 7 in an order to measure the performance of, arithmetic, group, join and filter operations. The datasets scaled scaled linearly; therefore the size equates to 3000 * 10 n : dataset size 1 consisted of 30,000 records (772KB), dataset size 2 consisted of 300,000 records 7 These datasets were joined against seventh dataset consisting of 1,000 records (23KB) 16

17 (6.4MB), dataset size 3 consisted of 3,000,000 records (63MB), dataset size 4 consisted of 30 million records (628MB), dataset size 5 consisted of 300 million records (6.2GB) and dataset size 6 consisted of 3 billion records (62GB). One obvious downside to the above datasets is their simplicity: in reality, databases tend to be much more complex and most certainly consist of tables containing more than just three columns. Furthermore, databases usually don t just consist of one or two tables (the queries executed as part of the benchmarks from section 3 involved 2 tables at most. In fact all queries, except the join, involved only 1 table). The benchmarks produced within this report address these shortcomings by employing the much richer TPC-H datasets generated using the TPC dbgen utility. This utility produces 8 individual tables (customer.tbl consisting of 15,000,000 records (2.3GB), lineitem.tbl consisting of 600,037,902 records (75GB), nation.tbl consisting of 25 records (4KB, orders.tbl consisting of 150,000,000 records (17GB), partsupp.tbl consisting of 80,000,000 records (12GB), part.tbl consisting of 20,000,000 records (2.3GB), region.tbl consisting of 5 records (4KB), supplier.tbl consisting of 1,000,000 records (137MB). 4.2 Test Cases The TPC-H test cases consist of 22 distinct queries, each of which were designed to exhibit the same (or higher) degree of complexity that is typically found in real-world scenarios, consist of varying query parameters and various types of access. They are designed so that each query covers a large part of each table/dataset[14]. 4.3 Test Setup Several modifications have been made to the cluster since we ran the first set of experiments detailed in section 3, and the cluster on which the TPC-H benchmarks were run now consist of 9 hosts: chewbacca.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3822MiB system memory. Running Ubuntu LTS, Precise Pangolin. queen.doc.ic.ac.uk GHz Intel(R) Core(TM) i5 CPU, 7847MiB system memory. Running Ubuntu LTS, Precise Pangolin. 17

18 awake.doc.ic.ac.uk GHz Intel(R) Core(TM) i5 CPU, 7847MiB system memory. Running Ubuntu LTS, Precise Pangolin. mavolio.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 5712MiB system memory. Running Ubuntu LTS, Precise Pangolin. zim.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3824MiB system memory. Running Ubuntu LTS, Precise Pangolin. zorin.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3872MiB system memory. Running Ubuntu LTS, Precise Pangolin. tiffanycase.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3872MiB system memory. Running Ubuntu LTS, Precise Pangolin. zosimus.doc.ic.ac.uk GHz Intel(R) Core(TM)2 Duo CPU, 3825MiB system memory. Running Ubuntu LTS, Precise Pangolin. artemis.doc.ic.ac.uk GHz Intel(R) Core(TM) i5 CPU, 7847MiB system memory. Running Ubuntu LTS, Precise Pangolin. Both the Hive and Pig TPC-H scripts are available for download from the Apache website. Section 4.6 presents additions to the set of benchmarks from section 3. The datasets and scripts used are identical to those presented in the section 3. Note: The Linux time utility was used to measure the average wall-clock time of each operation. For other metrics (CPU time, heap usage, etc) the Hadoop logs were used. 18

19 Results This section presents the results for both Pig and Hive. Hive (TPC-H) Running the TPC-H benchmarks for Hive produced the following results (note: script names were abbreviated): Running the TPC-H benchmarks for Hive produced the following results: Script Avg. runtime Std. dev. Avg. cumulative CPU time Avg. map tasks Avg. reduce tasks q q q q q q q q q q q q q q q q q q q q q q Table 10: TPC-H benchmark results for Hive using 6 trials (time is in seconds, unless indicated otherwise). 19

20 Script Avg. map heap usage Avg. reduce heap usage Avg.total heap usage Avg. map CPU time Avg. reduce CPU time Avg. total CPU time q q q q4 N/A N/A N/A N/A N/A N/A q5 N/A N/A N/A N/A N/A N/A q6 N/A N/A N/A N/A N/A N/A q q q q q q q q q15 N/A N/A N/A N/A N/A N/A q q17 N/A N/A N/A N/A N/A N/A q18 N/A N/A N/A N/A N/A N/A q q20 N/A N/A N/A N/A N/A N/A q21 N/A N/A N/A N/A N/A N/A q Table 11: TPC-H benchmark results for Hive using 6 trials. 20

21 Figure 4: Real time runtimes of all 22 TPC-H benchmark scripts for Hive. 4.4 Pig (TPC-H) Running the TPC-H benchmarks for Pig produced the following results (note: script names were abbreviated): 21

22 Script Avg. Std. dev. runtime q q q q q q q q q q q q q q q q q q q q q q Table 12: TPC-H benchmark results for Pig using 6 trials (time is in seconds, unless indicated otherwise). 22

23 Script Avg. map heap usage Avg. reduce heap usage Avg.total heap usage Avg. map CPU time Avg. reduce CPU time Avg. total CPU time q q q q q q6 N/A N/A N/A N/A N/A N/A q q q q10 N/A N/A N/A N/A N/A N/A q q q q q q q q q q q q Table 13: TPC-H benchmark results for Pig using 6 trials. 23

24 Figure 5: Real time runtimes of all 22 TPC-H benchmark scripts for Pig. Hive vs Pig (TPC-H) As shown in figure 7, Hive outperforms Pig in the majority of cases (12 to be precise). Their performance is roughly equivalent for 3 cases and Pig outperforms Hive in 6 cases. At first glance, this contradicts all results of the experiments from section 3. 24

25 Figure 6: Real time runtimes of all 22 TPC-H benchmark scripts contrasted. Upon examining the TPC-H benchmarks more closely, two issues stood out that explain this discrepancy. The first is that after a script writes results to disk, the output files are immediately deleted using Hadoop s fs -rmr command. This process is quite costly and is measured as part the real-time execution of the script (however the fact that this operation is expensive (in terms of runtime) is not catered for). In contrast, the HiveQL scripts merely drop tables at the beginning of the script. Dropping tables is cheap as it only involves manipulating the meta-information on the local filesystem - no interaction with the Hadoop filesystem is required. In fact, omitting the recursive delete operation reduces runtime by about 2%. In contrast, removing DROP TABLE in Hive does not produce any performance difference. The aforementioned issue only accounts for a small percent of inequality. What causes the actual performance difference is the heavy usage of the Group By operator in all but 3 TPC-H test scripts. Recall from section 3 that Pig outperformed Hive in all instances except when using the Group By operator: when grouping data Pig was 104% slower than Hive. 25

26 Figure 7: The runtime comparison between Pig and Hive (plotted in logarithmic scale) for the Group By operator based on the benchmarks from section 3. For example when running the TPC-H benchmarks for Pig, script 21 (q21 suppliers who kept orders waiting.pig) had a real-time runtime of seconds. 41% (or seconds) were required to execute the first Group By. In contrast, Hive only required seconds for the grouping of data. The script grouped data 3 times: -- This Group By took up 41% of the runtime gl = group lineitem by l_orderkey; [...] fo = filter orders by o_orderstatus == F ; [...] ores = order sres by numwait desc, s_name; Consequently, the excessive use of the Group By operator skews the benchmark results significantly. Re-running the scripts and omitting the the group- 26

27 ing of data produces the expected results. For example, running script 3 (q3 shipping priority.pig) and omitting the Group By operator significantly reduces the runtime (to seconds real time runtime or a total of 12,257,630ms CPU time). Figure 8: The total average heap usage (in bytes) of all 22 TPC-H benchmark scripts contrasted. Another interesting artefact is exposed by figure 8: In all instances, Hive s heap usage is significantly lower than that of Pig. This might be explained by the fact that Hive does not need to build intermediary data structure, but Pig (at the time of writing) does. 4.5 Determining optimal cluster configuration Manipulating the configuration of the benchmark from section 3 in an effort to determine optimal cluster usage produced interesting results. For one, data compression is important and significantly impacts runtime performance of JOIN and GROUP BY operations in Pig. For example, 27

28 enabling compression on dataset size 4 (which contains a large amount of random data) produces a 3.2% speed-up in real time runtime. Compression in Pig can be enabled by setting the pig.tmpfilecompression flag to true and then specifying the type of compression pig.tmpfilecompression.codec to either gzip or lzo. Note that gzip produces better compression whilst LZO is much faster in terms of runtime. By editing the entry for mapred.reduce.slowstart.completed.maps in Hadoop s conf/mapred-site.xml we can tune the percentage of map tasks that must be completed before reduce tasks can be created. By default, this value is set to 5% which was found to be too low for our cluster. Balancing the ratio of mappers and reducers is critical to optimizing performance: reducers should be started early enough so that data transfer is spread out over time and thus preventing network bottlenecks. On the other hand, reducers shouldn t be started late enough so that they do not use up slots that could be used by map tasks. Performance peaked when reduce tasks were started after 70% of map jobs completed. The maximum number of map and reduce tasks for a node can be specified using mapred.tasktracker.map.tasks.maximum and mapred.tasktracker. reduce.tasks.maximum. Naturally care should be taken when configuring these: having a node with a maximum of 20 map slots but a script configured to use 30 map slots will result in significant performance penalties as the first 20 map tasks will run in parallel, but the additional 10 will only be spawned once the first 20 map tasks have completed execution (consequently requiring one extra round of computation). The same goes for the number of reduce tasks: as is illustrated by figure 9, performance peaks when a task requires just little below the maximum number of reduce slots per node. 28

29 Figure 9: Real time runtimes contrasted with a variable number of reducers for join operations in Pig. 4.6 A small addition to section 3 - CPU runtimes One outstanding item that our first set of results failed to report was the contrast between real time runtime and CPU runtime. As expected, cumulative CPU runtime was higher than real time runtime (since tasks are distributed between nodes). 29

30 Figure 10: Real time runtime contrasted with CPU runtime for the Pig scripts run on dataset size 5. 5 Conclusion Of specific interest was the finding that Pig consistently outperformed Hive (with the exception of grouping data). Specifically: For arithmetic operations, Pig is 46% faster (on average) than Hive For filtering 10% of the data, Pig is 49% faster (on average) than Hive For filtering 90% of the data, Pig is 18% faster (on average) than Hive For joining datasets, Pig is 36% faster (on average) than Hive This conflicted with existing literature that found Hive to outperform Pig: In 2009, Apache s own performance benchmarks found that Pig was significantly slower than Hive. These findings were validated in 2011 by Stewart and Trinder et al who also found that Hive map-reduce jobs outperformed those produced by the Pig compiler. When forced to equal terms (that is, when forcing Hive to use the same number of mappers as Pig), Hive remains 67% slower than Pig when comparing 30

31 real time runtime (i.e. it takes Pig roughly 1/3 of the time to compute the JOIN. That is, increasing the number of map tasks in Hive from 4 to 11 only resulted in a 13% speed-up. It should also be noted that the performance difference between Pig and Hive does not scale linearly. That is, initially there is little difference in performance (this is due to the large start-up costs). However as the datasets increase in size, Hive becomes con- sistently slower (to the point of crashing when attempting to join large datasets). To conclude, the discussed experiments allowed for the answering of 4 core questions: 1. How do Pig and Hive perform as other Hadoop properties are varied (e.g. number of map tasks)? Balancing the ratio of mappers and reducers has a big impact on real time runtime and consequently is critical to optimizing performance: reducers should be started early enough so that data transfer is spread out sufficiently to prevent network congestions. On the other hand, reducers shouldn t be started so late that they do not use up slots that could be used by map tasks. Care should also be taken when setting the maximum allowable map and reduce slots per node. For example having a node with a maximum of 20 map slots but a script configured to use 30 map slots will result in significant performance penalties, because the first 20 map tasks will run in parallel, but the additional 10 will only be spawned once the first 20 map tasks have completed execution (consequently requiring one extra round of computation). The same goes for the number of reduce tasks: as is illustrated by figure 9. Performance peaks when a task requires a number of reduce slots per node that falls just below the maximum number. 2. Do more complex datasets and queries (e.g. TPC-H benchmarks) yield the same results than the Apache benchmarks from 11/07/07? At first glance, running the TPC-H benchmarks contradicts the Apache benchmark results. In nearly all instances, Hive outperforms Pig. However closer examination revealed that nearly all TPC-H scripts relied heavily on the Group By operator, an operator which appears to be poorly implemented in Pig. Using the Group By operator greatly degrades the performance of Pig Latin scripts. The TPC-H benchmark results might be less relevant to your decision process if the grouping will be not be a dominant feature for your application. (Because operators are not evenly distributed throughout the scripts: if one operator is poorly implemented, then this will 31

32 skew the entire result set - as can be seen in section 4.4) 3. How does real time runtime scale with regards to CPU runtime? As expected given the cluster configuration (9 nodes). The real time runtime was between 15%-20% of the cumulative CPU runtime. 4. What should the ratio of map and reduce tasks be? The ratio for map and reduce tasks can be configured through mapred.reduce.slowstart.completed.maps Hadoop s conf/mapred-site.xml. The default value of 0.05 (i.e. 5%) was found to be too low. The optimal for our cluster was at about 70%. It should also be noted that the use of the Group By operator within the TPC-H benchmarks skews results significantly (recall the Apache benchmarks that showed that Pig outperformed Hive in all instances except when using the Group By operator: when grouping data Pig was 104% slower than Hive). Re-running the scripts and omitting the the grouping of data produces the expected results. For example, running script 3 (q3 shipping priority.pig) and ommitting the Group By operator significantly reduces the runtime (to seconds real time runtime or a total of 12,257,630ms CPU time). As already noted in the introduction, Hadoop was designed to run on clusters containing hundreds / thousands of nodes, therefore running smallscale performance analysis may not really do it any justice. Ideally the benchmarks presented in this article should be run on much larger clusters. References [1] Stewart Robert J.; Trinder P.; Loidl H. (2011), Comparing High Level MapReduce Query Languages,. Springer Berlin Heidelberg, Advanced Parallel Processing Technologies, pages [2] Moussa, R. (2012), TPC-H Benchmarking of Pig Latin on a Hadoop Cluster,. Communications and Information Technology (ICCIT), 2012 International Conference, pages [3] Loebman S.; Nunley D.; Kwon Y.; Howe B.; Balazinska M.; Gardner. J.P. (2012), Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?,. Cin Proc. of CLUSTER. 2009, pages

33 [4] Schätzle A.; Przyjaciel-Zablocki M.; Hornung T.; Lausen G. (2011), PigSPARQL: Übersetzung von SPARQL nach Pig Latin,. Proc. BTW, pages [5] Lin J.; Dyer C. (2010) Data-intensive text processing with MapReduce,. Synthesis Lectures on Human Language Technologies, pages [6] Pavlo A.; Paulson E.; Rasin A.; Abadi D. J.; DeWitt D. J.; Madden S.; Stonebraker. M. (2009) A comparison of approaches to large-scale data analysis,. In Proc. SIGMOD, ACM, pages [7] DBPedias. (2013), Pig Performance Benchmarks,. and Visited 15/01/2013 [8] Gates, Alan F. and Natkovich, Olga and Chopra, Shubham and Kamath, Pradeep and Narayanamurthy, Shravan M. and Olston, Christopher and Reed, Benjamin and Srinivasan, Santhosh and Srivastava, Utkarsh (2009), Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience,. Proc. VLDB Endow., pages [9] Transaction Processing Council Transaction Processing Council Website,. Visited 18/06/2013 [10] Apache Software Foundation. (2009), Hive, PIG, Hadoop benchmark results,. benchmark pdf, Visited 03/01/2013 [11] Moussa, R. (2012), TPC-H Benchmarking of Pig Latin on a Hadoop Cluster,. Communications and Information Technology (ICCIT), 2012 International Conference, pages [12] Loebman S.; Nunley D.; Kwon Y.; Howe B.; Balazinska M.; Gardner. J.P. (2012), Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?,. Cin Proc. of CLUSTER. 2009, pages [13] Stewart Robert J.; Trinder P.; Loidl H. (2011), Comparing High Level MapReduce Query Languages,. Springer Berlin Heidelberg, Advanced Parallel Processing Technologies, pages [14] Transaction Processing Performance Council (TPC) (2013), TPC Benchmark H,. Standard Specification, Revision , Transaction Processing Performance Council (TPC), Presidio of San Francisco 33

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

AutoPig - Improving the Big Data user experience

AutoPig - Improving the Big Data user experience Imperial College London Department of Computing AutoPig - Improving the Big Data user experience Benjamin Jakobus Submitted in partial fulfilment of the requirements for the MSc degree in Advanced Computing

More information

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

Comparing High Level MapReduce Query Languages

Comparing High Level MapReduce Query Languages Comparing High Level MapReduce Query Languages R.J. Stewart, P.W. Trinder, and H-W. Loidl Mathematical And Computer Sciences Heriot Watt University *DRAFT VERSION* Abstract. The MapReduce parallel computational

More information

SARAH Statistical Analysis for Resource Allocation in Hadoop

SARAH Statistical Analysis for Resource Allocation in Hadoop SARAH Statistical Analysis for Resource Allocation in Hadoop Bruce Martin Cloudera, Inc. Palo Alto, California, USA bruce@cloudera.com Abstract Improving the performance of big data applications requires

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Enhancing Massive Data Analytics with the Hadoop Ecosystem www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

<Insert Picture Here> Oracle and/or Hadoop And what you need to know Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations A Dell Technical White Paper Database Solutions Engineering By Sudhansu Sekhar and Raghunatha

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Duke University http://www.cs.duke.edu/starfish

Duke University http://www.cs.duke.edu/starfish Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used

More information

Integrating Hadoop and Parallel DBMS

Integrating Hadoop and Parallel DBMS Integrating Hadoop and Parallel DBMS Yu Xu Pekka Kostamaa Like Gao Teradata San Diego, CA, USA and El Segundo, CA, USA {yu.xu,pekka.kostamaa,like.gao}@teradata.com ABSTRACT Teradata s parallel DBMS has

More information

Herodotos Herodotou Shivnath Babu. Duke University

Herodotos Herodotou Shivnath Babu. Duke University Herodotos Herodotou Shivnath Babu Duke University Analysis in the Big Data Era Popular option Hadoop software stack Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Parquet. Columnar storage for the people

Parquet. Columnar storage for the people Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

Hadoop: Small Cluster Performance

Hadoop: Small Cluster Performance 1 Hadoop: Small Cluster Performance Joshua Nester, Garrison Vaughan, Jonathan Pingilley, Calvin Sauerbier, and Adam Albertson Abstract This essay is intended to to show the performance and reliability

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics

More information

Ubuntu and Hadoop: the perfect match

Ubuntu and Hadoop: the perfect match WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Big Data : Experiments with Apache Hadoop and JBoss Community projects Big Data : Experiments with Apache Hadoop and JBoss Community projects About the speaker Anil Saldhana is Lead Security Architect at JBoss. Founder of PicketBox and PicketLink. Interested in using Big

More information

arxiv:1208.4166v1 [cs.db] 21 Aug 2012

arxiv:1208.4166v1 [cs.db] 21 Aug 2012 Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou University of Wisconsin-Madison floratou@cs.wisc.edu Nikhil Teletia Microsoft Jim Gray Systems Lab nikht@microsoft.com David J. DeWitt Microsoft

More information

Tableau Server 7.0 scalability

Tableau Server 7.0 scalability Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different

More information

Tableau Server Scalability Explained

Tableau Server Scalability Explained Tableau Server Scalability Explained Author: Neelesh Kamkolkar Tableau Software July 2013 p2 Executive Summary In March 2013, we ran scalability tests to understand the scalability of Tableau 8.0. We wanted

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

Hadoop Scripting with Jaql & Pig

Hadoop Scripting with Jaql & Pig Hadoop Scripting with Jaql & Pig Konstantin Haase und Johan Uhle 1 Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 2 Introduction Goal: Compare two high level scripting languages

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

JackHare: a framework for SQL to NoSQL translation using MapReduce

JackHare: a framework for SQL to NoSQL translation using MapReduce DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop : Flexible Data Placement and Its Exploitation in Hadoop 1 Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan Rainer Gemulla +, Aljoscha Krettek #, John McPherson IBM Almaden Research Center, USA {myeltaba,

More information

Hadoop and MySQL for Big Data

Hadoop and MySQL for Big Data Hadoop and MySQL for Big Data Alexander Rubin October 9, 2013 About Me Alexander Rubin, Principal Consultant, Percona Working with MySQL for over 10 years Started at MySQL AB, Sun Microsystems, Oracle

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

SOLUTION BRIEF: SLCM R12.7 PERFORMANCE TEST RESULTS JANUARY, 2012. Load Test Results for Submit and Approval Phases of Request Life Cycle

SOLUTION BRIEF: SLCM R12.7 PERFORMANCE TEST RESULTS JANUARY, 2012. Load Test Results for Submit and Approval Phases of Request Life Cycle SOLUTION BRIEF: SLCM R12.7 PERFORMANCE TEST RESULTS JANUARY, 2012 Load Test Results for Submit and Approval Phases of Request Life Cycle Table of Contents Executive Summary 3 Test Environment 4 Server

More information

Performance Analysis of Cloud Relational Database Services

Performance Analysis of Cloud Relational Database Services Performance Analysis of Cloud Relational Database Services Jialin Li lijl@cs.washington.edu Naveen Kr. Sharma naveenks@cs.washington.edu June 7, 203 Adriana Szekeres aaazs@cs.washington.edu Introduction

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information