BigDataBench. Khushbu Agarwal



Similar documents
On Big Data Benchmarking

BigOP:Generating Comprehensive Big Data Workloads as a Benchmarking May 17, 2014 Framework[1] 1 / 10

On Big Data Benchmarking

BPOE Research Highlights

BigDataBench: a Big Data Benchmark Suite from Internet Services

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Characterizing Task Usage Shapes in Google s Compute Clusters

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Performance Tuning and Optimizing SQL Databases 2016

Big Data Simulator version

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Performance Workload Design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Report on the Dagstuhl Seminar Data Quality on the Web

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

SQL Server Performance Tuning and Optimization

BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. Aayush Agrawal

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Data Warehouse: Introduction

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Tableau Server 7.0 scalability

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Microsoft SQL Server: MS Performance Tuning and Optimization Digital

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Business Usage Monitoring for Teradata

Fast, Low-Overhead Encryption for Apache Hadoop*

I/O Characterization of Big Data Workloads in Data Centers

Prerequisites. Course Outline

Memory System Characterization of Big Data Workloads

Types of Workloads. Raj Jain. Washington University in St. Louis

Performance Analysis of Web based Applications on Single and Multi Core Servers

Cloud Management: Knowing is Half The Battle

Application Performance Testing Basics

Concept and Project Objectives

Big Data: Study in Structured and Unstructured Data

BIG DATA IN BUSINESS ENVIRONMENT

Key Issues for Data Management and Integration, 2006

Selecting the Right Service Virtualization Tool. E: UK: US:

Recommendations for Performance Benchmarking

UPS battery remote monitoring system in cloud computing

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Benchmarking and Ranking Big Data Systems

locuz.com Big Data Services

Black-box Performance Models for Virtualized Web. Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Scalability and Performance Report - Analyzer 2007

Energy Efficient MapReduce

Characterizing Workload of Web Applications on Virtualized Servers

SQL Server Instance-Level Benchmarks with DVDStore

Advanced Analytics. The Way Forward for Businesses. Dr. Sujatha R Upadhyaya

Software Performance and Scalability

Performance Testing. Why is important? An introduction. Why is important? Delivering Excellence in Software Engineering

Architecture Support for Big Data Analytics

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

DELL s Oracle Database Advisor

CHAPTER 1 INTRODUCTION

SQL Maestro and the ELT Paradigm Shift

Benchmarking Cassandra on Violin

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

LPV model identification for power management of Web service systems Mara Tanelli, Danilo Ardagna, Marco Lovera

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Toad for Oracle 8.6 SQL Tuning

INTRODUCTION TO CASSANDRA

How To Test For Elulla

The Methodology Behind the Dell SQL Server Advisor Tool

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

QoS based Cloud Service Provider Selection Framework

How To Model A System

The 4 Pillars of Technosoft s Big Data Practice

OnX Big Data Reference Architecture

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Tuning Tableau Server for High Performance

On a Hadoop-based Analytics Service System

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Outline. Introduction. State-of-the-art Forensic Methods. Hardware-based Workload Forensics. Experimental Results. Summary. OS level Hypervisor level

Process Mining in Big Data Scenario

Big Data Storage Architecture Design in Cloud Computing

Transcription:

BigDataBench Khushbu Agarwal Last Updated: May 23, 2014

CONTENTS Contents 1 What is BigDataBench? [1] 1 1.1 SUMMARY.................................. 1 1.2 METHODOLOGY.............................. 1 2 BigDataBench: a Big Data Benchmark Suite from Internet Services [2] 2 2.1 SUMMARY.................................. 2 2.1.1 OBSERVATION........................... 2 2.2 PROBLEMS IDENTIFIED......................... 4 3 BigDataBench: a Big Data Benchmark Suite from Web Search Engines [3] 5 3.1 SUMMARY.................................. 5 3.2 OBSERVATION............................... 6 3.3 POSITIVE POINT S............................. 7 3.4 PROBLEMS IDENTIFIED......................... 8 Khushbu Agarwal May 23, 2014 i

1 WHAT IS BIGDATABENCH? [?] 1 What is BigDataBench? [1] 1.1 SUMMARY BigDataBench is a big data benchmark suite with current version of BigDataBench 3.0. It consists of 6 real-world and 2 synthetic data sets, and 32 big data workloads. It covers micro and application benchmarks from areas of search engine,social networks,e-commerce. To create variety of workloads,bigdatabench focuses on units of computation frequently occuring in OLTP and OLAP,interactive and offline analytics. It provides several BDGS(big data generation tools) to generate scalable big data. It is open source under Apache License Version 2.0. 1.2 METHODOLOGY It consists of six steps overall: 1. Investigating typical application domains. 2. Understanding and chossing workloads and data sets. 3. Generating scalable data sets and workloads. 4. Provide different implementations. 5. Provide system characterization. 6. Lastly,finalizing benchmarks. Khushbu Agarwal May 23, 2014 1

2 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM INTERNET SERVICES [?] 2 BigDataBench: a Big Data Benchmark Suite from Internet Services [2] 2.1 SUMMARY Data are generated faster than ever, the speed of data generation will continue in the coming years and is expected to increase at an exponential level.these facts evolves the concept of BigData.The diversity of data and workloads needs comprehensive and continuous efforts on big data benchmarking.considering the broad use of big data systems,for the sake of fairness, big data benchmarks must include diversity of workloads and data sets, which is the prerequisite for evaluating big data systems and architecture.bigdatabench not only covers broad application scenarios, but also includes diverse and representative data sets. 2.1.1 OBSERVATION In the methodology of BigDataBench, after investing the application domains of internet services,workloads on search engines,e-commerce,and social networks is focused.in addition to it we have micro benchmarks for different data sources,oltp workloads and relational queries workloads,since they are fundamental and widely used. For these three application domains,six representative real-world data sets are collected,whose variety is reflected in two dimensions of data types and data sources with the whole spectrum of data types including structured,semi-structured and unstructured data. To date,nineteen big data benchmarks from dimensions of application scenarios, operations/ algorithms, data types, data sources, software stacks, and application types have been developed. In comparision to tradional benchmarks,including HPCC,PARSEC,and SPEC- CPU, the floating point operation intensity of BigDataBench is two orders of magnitude lower than in traditional benchmarks. The volume of data input has non-negligible impact on micro-architecture events. Big Data Benchmarking Requirements: A big data benchmark suite candidate must cover not only broad application scenarios, but also diverse and representative real world data sets. Big data systems must be handle the four dimensions called 4V of big data. Diverse and representative workloads. Covering representative software stacks. A big data benchmark suite should keep in pace with the improvements of the underlying systems. The benchmarks should be easy to deploy,configure, and run, and the performance data should be easy to obtain. Khushbu Agarwal May 23, 2014 2

2 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM INTERNET SERVICES [?] The BigDataBench workloads is chosen with the following considerations: Paying equal attention to different applications:online service,real-time analytics and offline analytics. Covering workloads in diverse and representative application scenarios. Includes differnt data sources. Covers the representative software stacks. Big Data Genarator is a comprehensive tool to generate synthetic data.the data generators are classified for a wide class of application domains. Two categories of metrices are used for evaluation: User-perceivable metrices(rps,ops,dps). Architectural metrices(mips,mpki). Different big data workloads have different performance trends as the data scale increases. Architectural metrics are closely related to input data volumes and vary for different workloads. L3 caches of the processor are efficient for the big data workloads. Khushbu Agarwal May 23, 2014 3

2 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM INTERNET SERVICES [?] 2.2 PROBLEMS IDENTIFIED The complexity,diversity,frequently changed workloads and the rapid evolution of big data systems impose great challenges to big data benchmarking. Most of the big data benchmark efforts target evaluating specific types of applications or system software stacks, and hence fail to cover diversity of workloads and real-world data sets. Although BigBench has variety of data types, its object under test is DBMS and MapReduce systems that claim to provide big data solutions, leading to partial coverage of software stacks. Furthermore, currently, it is not open-source for easy usage and adoption. The operation intensity of the big data workloads is low. Khushbu Agarwal May 23, 2014 4

3 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM WEB SEARCH ENGINES [?] 3 BigDataBench: a Big Data Benchmark Suite from Web Search Engines [3] 3.1 SUMMARY Big Data are considered as the asset of companies,organizations and even countries. Extracting the big value from Big Data requires enabling big data systems.after investigating different application domains of Internet services,an important class of big data applications,we pay attention to search engines, which are the most important domain in Internet services in terms of the number of page views and daily visitors.a detailed analysis of search engines workloads and benchmarking methodology has been presented in the paper.an innovative data generation methodology and tool are proposed to generate scalable volumes of big data from a small seed of real data. Khushbu Agarwal May 23, 2014 5

3 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM WEB SEARCH ENGINES [?] 3.2 OBSERVATION The peak data processing rates of big data systems are both applications and data volumes dependent. The developement of a semantic search engine ProfSearch,which paves the path for big data benchmark suite from search engines-bigdatabench. Synthetic data is generated for benchmarking which preserves the semantic and locality characteristics of real data. The following workloads are chosen for BigDataBench: Sort,Grep,WordCount,Naive Bayes and SVM. The key characteristics of search workload trace are query sequence and timing sequencs. Some architectural events like cache and TLB behaviours are trending towards stability only on condition that data volume increases to a certain extent. Khushbu Agarwal May 23, 2014 6

3 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM WEB SEARCH ENGINES [?] 3.3 POSITIVE POINT S For the synthetic data and real data,the data processing rates of the workloads are close and the deviation of two data sets with the same workload is less than 12.9%. The cache and TLB behaviours for real and synthetic are close and the deviation of two data sets with the same workload is very less. Khushbu Agarwal May 23, 2014 7

3 BIGDATABENCH: A BIG DATA BENCHMARK SUITE FROM WEB SEARCH ENGINES [?] 3.4 PROBLEMS IDENTIFIED Search engine service providers treat data, applications,and web access logs as business confidentiality, which prevents us from building benchmarks. Khushbu Agarwal May 23, 2014 8

REFERENCES References [1] BigDataBench. Available at http://prof.ict.ac.cn/bigdatabench. Downloaded in May 2014. [2] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, et al., Bigdatabench: A big data benchmark suite from internet services, arxiv preprint arxiv:1401.1406, 2014. [3] W. Gao, Y. Zhu, Z. Jia, C. Luo, L. Wang, Z. Li, J. Zhan, Y. Qi, Y. He, S. Gong, et al., Bigdatabench: a big data benchmark suite from web search engines, arxiv preprint arxiv:1307.0320, 2013. Khushbu Agarwal May 23, 2014 9