BigDataBench Khushbu Agarwal Last Updated: May 23, 2014
CONTENTS Contents 1 What is BigDataBench? [1] 1 1.1 SUMMARY.................................. 1 1.2 METHODOLOGY.............................. 1 2 BigDataBench: a Big Data Benchmark Suite from Internet Services [2] 2 2.1 SUMMARY.................................. 2 2.1.1 OBSERVATION........................... 2 2.2 PROBLEMS IDENTIFIED......................... 4 3 BigDataBench: a Big Data Benchmark Suite from Web Search Engines [3] 5 3.1 SUMMARY.................................. 5 3.2 OBSERVATION............................... 6 3.3 POSITIVE POINT S............................. 7 3.4 PROBLEMS IDENTIFIED......................... 8 Khushbu Agarwal May 23, 2014 i
1 WHAT IS BIGDATABENCH? [?] 1 What is BigDataBench? [1] 1.1 SUMMARY BigDataBench is a big data benchmark suite with current version of BigDataBench 3.0. It consists of 6 real-world and 2 synthetic data sets, and 32 big data workloads. It covers micro and application benchmarks from areas of search engine,social networks,e-commerce. To create variety of workloads,bigdatabench focuses on units of computation frequently occuring in OLTP and OLAP,interactive and offline analytics. It provides several BDGS(big data generation tools) to generate scalable big data. It is open source under Apache License Version 2.0. 1.2 METHODOLOGY It consists of six steps overall: 1. Investigating typical application domains. 2. Understanding and chossing workloads and data sets. 3. Generating scalable data sets and workloads. 4. Provide different implementations. 5. Provide system characterization. 6. Lastly,finalizing benchmarks. Khushbu Agarwal May 23, 2014 1
2 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM INTERNET SERVICES [?] 2 BigDataBench: a Big Data Benchmark Suite from Internet Services [2] 2.1 SUMMARY Data are generated faster than ever, the speed of data generation will continue in the coming years and is expected to increase at an exponential level.these facts evolves the concept of BigData.The diversity of data and workloads needs comprehensive and continuous efforts on big data benchmarking.considering the broad use of big data systems,for the sake of fairness, big data benchmarks must include diversity of workloads and data sets, which is the prerequisite for evaluating big data systems and architecture.bigdatabench not only covers broad application scenarios, but also includes diverse and representative data sets. 2.1.1 OBSERVATION In the methodology of BigDataBench, after investing the application domains of internet services,workloads on search engines,e-commerce,and social networks is focused.in addition to it we have micro benchmarks for different data sources,oltp workloads and relational queries workloads,since they are fundamental and widely used. For these three application domains,six representative real-world data sets are collected,whose variety is reflected in two dimensions of data types and data sources with the whole spectrum of data types including structured,semi-structured and unstructured data. To date,nineteen big data benchmarks from dimensions of application scenarios, operations/ algorithms, data types, data sources, software stacks, and application types have been developed. In comparision to tradional benchmarks,including HPCC,PARSEC,and SPEC- CPU, the floating point operation intensity of BigDataBench is two orders of magnitude lower than in traditional benchmarks. The volume of data input has non-negligible impact on micro-architecture events. Big Data Benchmarking Requirements: A big data benchmark suite candidate must cover not only broad application scenarios, but also diverse and representative real world data sets. Big data systems must be handle the four dimensions called 4V of big data. Diverse and representative workloads. Covering representative software stacks. A big data benchmark suite should keep in pace with the improvements of the underlying systems. The benchmarks should be easy to deploy,configure, and run, and the performance data should be easy to obtain. Khushbu Agarwal May 23, 2014 2
2 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM INTERNET SERVICES [?] The BigDataBench workloads is chosen with the following considerations: Paying equal attention to different applications:online service,real-time analytics and offline analytics. Covering workloads in diverse and representative application scenarios. Includes differnt data sources. Covers the representative software stacks. Big Data Genarator is a comprehensive tool to generate synthetic data.the data generators are classified for a wide class of application domains. Two categories of metrices are used for evaluation: User-perceivable metrices(rps,ops,dps). Architectural metrices(mips,mpki). Different big data workloads have different performance trends as the data scale increases. Architectural metrics are closely related to input data volumes and vary for different workloads. L3 caches of the processor are efficient for the big data workloads. Khushbu Agarwal May 23, 2014 3
2 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM INTERNET SERVICES [?] 2.2 PROBLEMS IDENTIFIED The complexity,diversity,frequently changed workloads and the rapid evolution of big data systems impose great challenges to big data benchmarking. Most of the big data benchmark efforts target evaluating specific types of applications or system software stacks, and hence fail to cover diversity of workloads and real-world data sets. Although BigBench has variety of data types, its object under test is DBMS and MapReduce systems that claim to provide big data solutions, leading to partial coverage of software stacks. Furthermore, currently, it is not open-source for easy usage and adoption. The operation intensity of the big data workloads is low. Khushbu Agarwal May 23, 2014 4
3 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM WEB SEARCH ENGINES [?] 3 BigDataBench: a Big Data Benchmark Suite from Web Search Engines [3] 3.1 SUMMARY Big Data are considered as the asset of companies,organizations and even countries. Extracting the big value from Big Data requires enabling big data systems.after investigating different application domains of Internet services,an important class of big data applications,we pay attention to search engines, which are the most important domain in Internet services in terms of the number of page views and daily visitors.a detailed analysis of search engines workloads and benchmarking methodology has been presented in the paper.an innovative data generation methodology and tool are proposed to generate scalable volumes of big data from a small seed of real data. Khushbu Agarwal May 23, 2014 5
3 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM WEB SEARCH ENGINES [?] 3.2 OBSERVATION The peak data processing rates of big data systems are both applications and data volumes dependent. The developement of a semantic search engine ProfSearch,which paves the path for big data benchmark suite from search engines-bigdatabench. Synthetic data is generated for benchmarking which preserves the semantic and locality characteristics of real data. The following workloads are chosen for BigDataBench: Sort,Grep,WordCount,Naive Bayes and SVM. The key characteristics of search workload trace are query sequence and timing sequencs. Some architectural events like cache and TLB behaviours are trending towards stability only on condition that data volume increases to a certain extent. Khushbu Agarwal May 23, 2014 6
3 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM WEB SEARCH ENGINES [?] 3.3 POSITIVE POINT S For the synthetic data and real data,the data processing rates of the workloads are close and the deviation of two data sets with the same workload is less than 12.9%. The cache and TLB behaviours for real and synthetic are close and the deviation of two data sets with the same workload is very less. Khushbu Agarwal May 23, 2014 7
3 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM WEB SEARCH ENGINES [?] 3.4 PROBLEMS IDENTIFIED Search engine service providers treat data, applications,and web access logs as business confidentiality, which prevents us from building benchmarks. Khushbu Agarwal May 23, 2014 8
REFERENCES References [1] BigDataBench. Available at http://prof.ict.ac.cn/bigdatabench. Downloaded in May 2014. [2] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, et al., Bigdatabench: A big data benchmark suite from internet services, arxiv preprint arxiv:1401.1406, 2014. [3] W. Gao, Y. Zhu, Z. Jia, C. Luo, L. Wang, Z. Li, J. Zhan, Y. Qi, Y. He, S. Gong, et al., Bigdatabench: a big data benchmark suite from web search engines, arxiv preprint arxiv:1307.0320, 2013. Khushbu Agarwal May 23, 2014 9