How To Write A Bigbench Benchmark For A Retailer



Similar documents
IEEE BigData 2014 Tutorial on Big Data Benchmarking

BigBench: Towards an Industry Standard Benchmark for Big DataAnalytics

The BigData Top100 List Initiative. Chaitan Baru San Diego Supercomputer Center

BigBench: Towards an Industry Standard Benchmark for Big Data Analytics

4th Workshop on Big Data Benchmarking

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Testing 3Vs (Volume, Variety and Velocity) of Big Data

How To Scale Out Of A Nosql Database

Native Connectivity to Big Data Sources in MSTR 10

Tap into Hadoop and Other No SQL Sources

Unified Big Data Processing with Apache Spark. Matei

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Scaling Out With Apache Spark. DTL Meeting Slides based on

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Moving From Hadoop to Spark

The BigData Top100 List Initiative. Speakers: Chaitan Baru, San Diego Supercomputer Center, UC San Diego Milind Bhandarkar, Greenplum/EMC

The Internet of Things and Big Data: Intro

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Constructing a Data Lake: Hadoop and Oracle Database United!

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data. Lyle Ungar, University of Pennsylvania

Ali Ghodsi Head of PM and Engineering Databricks

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Which SQL Engine Leads the Herd?

So What s the Big Deal?

Hadoop Job Oriented Training Agenda

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Big Data and Industrial Internet

Big Data Technologies Compared June 2014

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop & Spark Using Amazon EMR

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Unified Big Data Analytics Pipeline. 连 城

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Teradata Unified Big Data Architecture

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Setting the Direction for Big Data Benchmark Standards

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

A Study of Data Management Technology for Handling Big Data

CS 378 Big Data Programming. Lecture 2 Map- Reduce

TECHNOLOGY TRANSFER PRESENTS MIKE FERGUSON BIG DATA MULTI-PLATFORM JUNE 25-27, 2014 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Big Data on Microsoft Platform

NextGen Infrastructure for Big DATA Analytics.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data and Data Science: Behind the Buzz Words

Hadoop IST 734 SS CHUNG

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Internals of Hadoop Application Framework and Distributed File System

CS 378 Big Data Programming

Putting Apache Kafka to Use!

The Inside Scoop on Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Systems, Big Data

KNIME & Avira, or how I ve learned to love Big Data

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Real Time Big Data Processing

Big Data Too Big To Ignore

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data

Information Architecture

A Brief Outline on Bigdata Hadoop

Big Data With Hadoop

HiBench Introduction. Carson Wang Software & Services Group

Next-Gen Big Data Analytics using the Spark stack

CSE-E5430 Scalable Cloud Computing Lecture 2

A Performance Analysis of Distributed Indexing using Terrier

Big Data Generation. Tilmann Rabl and Hans-Arno Jacobsen

Luncheon Webinar Series May 13, 2013

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Big Data Course Highlights

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Oracle Big Data SQL Technical Update

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop-BAM and SeqPig

MapReduce with Apache Hadoop Analysing Big Data

Data processing goes big

Big Data and Apache Hadoop s MapReduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Explained. An introduction to Big Data Science.

Brave New World: Hadoop vs. Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

BIG DATA What it is and how to use?

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Large scale processing using Hadoop. Ján Vaňo

Challenges for Data Driven Systems

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Setting the Direction for Big Data Benchmark Standards Chaitan Baru, PhD San Diego Supercomputer Center UC San Diego

Chapter 7. Using Hadoop Cluster and MapReduce

Transcription:

BigBench Overview Towards a Comprehensive End-to-End Benchmark for Big Data - bankmark UG (haftungsbeschränkt) 02/04/2015 @ SPEC RG Big Data

The BigBench Proposal End to end benchmark Application level Based on a product retailer (TPC-DS) Focused on Parallel DBMS and MR engines History Launched at 1 st WBDB, San Jose Published at SIGMOD 2013 Spec at WBDB proceedings 2012 (queries & data set) Full kit at WBDB 2014 Collaboration with Industry & Academia First: Teradata, University of Toronto, Oracle, InfoSizing Now: bankmark, CLDS, Cisco, Cloudera, Hortonworks, Infosizing, Intel, Microsoft, Oracle, Pivotal, SAP, IBM, UoFT, 2

Before BigBench Micro-Benchmarks System level measurement Illustrative not informative Functional Benchmarks Better than micro-benchmarks Simplified approach limited representation in e2e Benchmark suites Collection of micro and functional Standardization problems 3

Specs and Standards TPC xhs Very first Industry standard Detailed metrics and run rules Model framework TPC-DS BigData Derive as-is from TPC-DS Query based for SQL on Hadoop BigBench Based on new specification with some reuse Complex batch analytics Long term bridge 4

BigBench - 2013 Collaborative industry effort Sigmod 2013 Address 3V s of big data Very first concept for a big data benchmark specification Wide industry support Use case sampling Retail use case example End to end and component Framework Agnostic Well defined specification SW based reference implementation 5

Derived from TPC-DS Multiple snowflake schemas with shared dimensions 24 tables with an average of 18 columns 99 distinct SQL 99 queries with random substitutions Representative skewed database content Sub-linear scaling of non-fact tables Ad-hoc, reporting, iterative and extraction queries ETL-like data maintenance Catalog Returns Catalog Sales Time Dim Item Warehouse Ship Mode Inventory Date Dim Web Site Web Sales Web Returns Web Sales Promotion Income Band Store Returns Store Sales Customer Demographics Customer Address Customer Household Demographics 6

BigBench Data Model Marketprice Web Page Structured Data Sales Web Log Item Customer Unstructured Data Reviews Adapted TPC-DS Structured: TPC-DS + market prices Semi-structured: website clickstream Unstructured: customers reviews Semi-Structured Data BigBench Specific 7

Data Model 3 Vs Variety Different schema parts Volume Based on scale factor Similar to TPC-DS scaling, but continuous Weblogs & product reviews also scaled Velocity Refresh for all data with different velocities 8

Scaling Continuous scaling model Realistic SF 1 ~ 1 GB Different scaling speeds Adapted from TPC-DS Static Square root Logarithmic Linear (LF) 9

Generating Big Data Repeatable computation Based on XORSHIFT random number generators Hierarchical seeding strategy Enables independent generation of every value in the data set Enables independent re-generation of every value for references User specifies Schema data model Format CSV, SQL statements, Distribution multi-core, multi-node, partially PDGF generates High quality data distributed, in parallel, in the correct format Large data terabytes, petabytes 10

Workload Workload Queries 30 queries Specified in English (sort of) No required syntax (first implementation in Aster SQL MR) Kit implemented in Hive, HadoopMR, Mahout, OpenNLP Business functions (adapted from McKinsey report) Marketing Cross-selling, customer micro-segmentation, sentiment analysis, enhancing multichannel consumer experiences Merchandising Assortment optimization, pricing optimization Operations Performance transparency, product return analysis Supply chain Inventory management Reporting (customers and products) 11

Workload - Technical Aspects Generic Characteristics Data Sources #Queries Percentage Structured 18 60% Semi-structured 7 23% Un-structured 5 17% Hive Implementation Characteristics Query Types #Queries Percentage Pure HiveQL 14 46% Mahout 5 17% OpenNLP 5 17% Custom MR 6 20% Query Input Datatype Processing Model Query Input Datatype Processing Model #1 Structured Java MR #16 Structured Java MR (OpenNLP) #2 Semi-Structured Java MR #17 Structured HiveQL #3 Semi-Structured Python Streaming MR #18 Unstructured Java MR (OpenNLP) #4 Semi-Structured Python Streaming MR #19 Structured Java MR (OpenNLP) #5 Semi-Structured HiveQL #20 Structured Java MR (Mahout) #6 Structured HiveQL #21 Structured HiveQL #7 Structured HiveQL #22 Structured HiveQL #8 Semi-Structured HiveQL #23 Structured HiveQL #9 Structured HiveQL #24 Structured HiveQL #10 Unstructured Java MR (OpenNLP) #25 Structured Java MR (Mahout) #11 Unstructured HiveQL #26 Structured Java MR (Mahout) #12 Semi-Structured HiveQL #27 Unstructured Java MR (OpenNLP) #13 Structured HiveQL #28 Unstructured Java MR (Mahout) #14 Structured HiveQL #29 Structured Python Streaming MR #15 Structured Java MR (Mahout) #30 Semi-Structured Python Streaming MR 12

SQL-MR Query 1 SELECT category_cd1 AS category1_cd, category_cd2 AS category2_cd, COUNT (*) AS cnt FROM basket_generator ( ON ( SELECT i.i_category_id AS category_cd, s.ws_bill_customer_sk AS customer_id FROM web_sales s INNER JOIN item i ON s.ws_item_sk = i.item_sk ) PARTITION BY customer_id BASKET_ITEM ( category_cd') ITEM_SET_MAX (500) ) GROUP BY 1,2 ORDER BY 1, 3, 2; 13

HiveQL Query 1 SELECT pid1, pid2, COUNT (*) AS cnt FROM ( FROM ( FROM ( SELECT s.ss_ticket_number AS oid, s.ss_item_sk AS pid FROM store_sales s INNER JOIN item i ON s.ss_item_sk = i.i_item_sk WHERE i.i_category_id in (1,2,3) and s.ss_store_sk in (10, 20, 33, 40, 50) ) q01_temp_join MAP q01_temp_join.oid, q01_temp_join.pid USING 'cat' AS oid, pid CLUSTER BY oid ) q01_map_output REDUCE q01_map_output.oid, q01_map_output.pid USING 'java -cp bigbenchqueriesmr.jar:hive-contrib.jar de.bankmark.bigbench.queries.q01.red' AS (pid1 BIGINT, pid2 BIGINT) ) q01_temp_basket GROUP BY pid1, pid2 HAVING COUNT (pid1) > 49 ORDER BY pid1, cnt, pid2; 14

Benchmark Process Adapted to batch systems No trickle update Measured processes Loading Power Test (single user run) Throughput Test I (multi user run) Data Maintenance Throughput Test II (multi user run) Result Additive Metric Data Generation Loading Power Test Throughput Test I Data Maintenance Throughput Test II Result 15

Metric Throughput metric BigBench queries per hour Number of queries run 30*(2*S+1) Measured times Metric Source: www.wikipedia.de 16

DG LOAD q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20 q21 q22 q23 q24 q25 q26 q27 q28 q29 q30 BigBench Experiments Tests on Cloudera CDH 5.0, Pivotal GPHD-3.0.1.0, IBM InfoSphere BigInsights In progress: Spark, Impala, Stinger, 3 Clusters (+) 1 node: 2x Xeon E5-2450 0 @ 2.10GHz, 64GB RAM, 2 x 2TB HDD 6 nodes: 2 x Xeon E5-2680 v2 @2.80GHz, 128GB RAM, 12 x 2TB HDD 546 nodes: 2 x Xeon X5670 @2.93GHz, 48GB RAM, 12 x 2TB HDD Q16 1:26:24 0:57:36 0:28:48 0:00:00 Q01 17

BigBench reference implementation - 2014 Hadoop Map-Reduce and Hive Hadoop Map-Reduce 2.0 HIVE, Mahout Java 1.7 Reference Kit Queries All 30 queries are implemented. Represents Structured, Semi-Structrured, Un-Structured data types. Complete runnable kit Data generator, queries, benchmark driver Tested on various Hadoop implementation Easy to configure and run, detailed setup instructions https://github.com/intel-hadoop/big-bench Bring some time Full BigBench run on Hive takes 2 days+ Will verify if your cluster is setup correctly 18

BigBench 2015+ - Towards a Big Data Pipeline Extensive Procedural Coverage Large input Datasets Web-based procedural Functional component level Machine Learning Dedicated machine learning workloads Algorithm coverage Segmented Others Key-value pair representation Cover graph Streaming applications Multimedia 19

BigBench 2015 Metrics Metric Current metric based on geometric mean Represent performance / $ - scaling factor Segmented Structural Unstructured/procedural Machine learning Other components All components = E2E Machine learning accuracy Fault tolerant performance 20

Contact tilmann.rabl@bankmark.de www.bankmark.de 21