HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014
CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks.............................. 1 3.1.1 Sort (sort)............................... 1 3.1.2 WordCount (wordcount)....................... 1 3.1.3 TeraSort (terasort).......................... 1 3.2 HDFS Benchmarks.............................. 2 3.2.1 enhanced DFSIO (dfsioe)....................... 2 3.3 Web Search Benchmarks........................... 2 3.3.1 Nutch indexing (nutchindexing)................... 2 3.3.2 PageRank (pagerank)......................... 2 3.4 Machine Learning Benchmarks........................ 2 3.4.1 Mahout Bayesian classification (bayes)............... 2 3.4.2 Mahout K-means clustering (kmeans)................ 2 3.5 Data Analytics Benchmarks......................... 2 3.5.1 Hive Query Benchmarks (hivebench)................ 2 Sunil Raiyani, Jayam Modi May 23, 2014 i
3 HIBENCH BENCHMARKS[?] 1 Introduction This report briefly describes the simple procedure used for installation of HiBench tool and the functionality of the tool. 2 Installation The HiBench tool being a plug and play tool can be installed on the system in the following simple steps: Install Hadoop on the system using the steps described on http://www.michaelnoll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/. [1] Download the HiBench Suite from https://github.com/intel-hadoop/hibench/zipball/hibench- 2.2 [2] Extract the.tar file and rename the folder as HiBench. In the bin subdirectory edit the hibench-config.sh file to change $HADOOP HOME to /usr/local/hadoop. Now run the run-all.sh script in the same subdirectory from the terminal. The report of the tests will be found in hibench-report file in HiBench directory. 3 HiBench Benchmarks[3] The HiBench tool runs nine different types of tests 3.1 Micro Benchmarks 3.1.1 Sort (sort) This workload sorts its text input data, which is generated using the Hadoop Random- TextWriter example. 3.1.2 WordCount (wordcount) This workload counts the occurrence of each word in the input data, which are generated using the Hadoop RandomTextWriter example. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set. 3.1.3 TeraSort (terasort) TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program. Sunil Raiyani, Jayam Modi May 23, 2014 1
3 HIBENCH BENCHMARKS[?] 3.2 HDFS Benchmarks 3.2.1 enhanced DFSIO (dfsioe) Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. 3.3 Web Search Benchmarks 3.3.1 Nutch indexing (nutchindexing) Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file /usr/share/dict/linux.words. 3.3.2 PageRank (pagerank) The workloads contains an implementation of the PageRank algorithm on Hadoop (a search engine ranking benchmark included in pegasus 2.0). The workload uses the automatically generated Web data whose hyperlinks follow the Zipfian distribution. 3.4 Machine Learning Benchmarks 3.4.1 Mahout Bayesian classification (bayes) Large-scale machine learning is another important use of MapReduce. This workload tests the Naive Bayesian (a popular classification algorithm for knowledge discovery and data mining) trainer in Mahout 0.7, which is an open source (Apache project) machine learning library. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words. 3.4.2 Mahout K-means clustering (kmeans) This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Mahout 0.7. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution. 3.5 Data Analytics Benchmarks 3.5.1 Hive Query Benchmarks (hivebench) This workload is developed based on SIGMOD 09 paper A Comparison of Approaches to Large-Scale Data Analysis and HIVE-396. It contains Hive queries (Aggregation and Join) performing the typical OLAP queries described in the paper. Its input is also automatically generated Web data with hyperlinks following the Zipfian distribution. Sunil Raiyani, Jayam Modi May 23, 2014 2
REFERENCES References [1] Hadoop Installation http://www.michael-noll.com/tutorials/ running-hadoop-on-ubuntu-linux-single-node-cluster/ May 23, 2014 [2] HiBench 2.2 https://github.com/intel-hadoop/hibench/zipball/hibench-2. 2 [3] intel-hadoop/hibench https://github.com/intel-hadoop/hibench May 23, 2014 Sunil Raiyani, Jayam Modi May 23, 2014 3