Setting the Direction for Big Data Benchmark Standards

Similar documents

Setting the Direction for Big Data Benchmark Standards Chaitan Baru, PhD San Diego Supercomputer Center UC San Diego

Setting the Direction for Big Data Benchmark Standards 1

Welcome to the 6 th Workshop on Big Data Benchmarking

BENCHMARKING BIG DATA SYSTEMS AND THE BIGDATA TOP100 LIST

How To Write A Bigbench Benchmark For A Retailer

IEEE BigData 2014 Tutorial on Big Data Benchmarking

Shaping the Landscape of Industry Standard Benchmarks: Contributions of the Transaction Processing Performance Council (TPC)

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Survey of Big Data Benchmarking

SPEC Research Group. Sam Kounev. SPEC 2015 Annual Meeting. Austin, TX, February 5, 2015

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

Pre-Conference Seminar E: Flash Storage Networking

Big Data Technologies Compared June 2014

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Big Data in Financial Services Industry: Market Trends, Challenges, and Prospects

Big Data Generation. Tilmann Rabl and Hans-Arno Jacobsen

BigBench: Towards an Industry Standard Benchmark for Big Data Analytics

Automating Big Data Benchmarking for Different Architectures with ALOJA

Proact whitepaper on Big Data

Global Hadoop Market (Hardware, Software, Services) applications, Geography, Haas, Global Trends,Opportunities, Segmentation and Forecast

How To Understand The Business Case For Big Data

TABLE OF CONTENTS 1 Chapter 1: Introduction 2 Chapter 2: Big Data Technology & Business Case 3 Chapter 3: Key Investment Sectors for Big Data

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Cloud SingularLogic:

White Paper: Datameer s User-Focused Big Data Solutions

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

BPOE Research Highlights

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Dell Reference Configuration for Hortonworks Data Platform

Big Data and Data Science: Behind the Buzz Words

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Ubuntu and Hadoop: the perfect match

NoSQL and Hadoop Technologies On Oracle Cloud

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast,

Big Data Services Market in Western Europe

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Introducing EEMBC Cloud and Big Data Server Benchmarks

The little elephant driving Big Data

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Introduction to Decision Support, Data Warehousing, Business Intelligence, and Analytical Load Testing for all Databases

Big Data Multi-Platform Analytics (Hadoop, NoSQL, Graph, Analytical Database)

Big Data Market Size and Vendor Revenues

Building Your Big Data Team

Apache Hadoop Patterns of Use

The Inside Scoop on Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Luncheon Webinar Series May 13, 2013

Next-Gen Big Data Analytics using the Spark stack

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

Data Warehousing in the Age of Big Data

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

How to Enhance Traditional BI Architecture to Leverage Big Data

Application Development. A Paradigm Shift

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Big Data Explained. An introduction to Big Data Science.

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Introduction to Decision Support, Data Warehousing, Business Intelligence, and Analytical Load Testing for all Databases

The Future of Data Management

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Hadoop & its Usage at Facebook

INTRODUCTION TO CASSANDRA

Information Architecture

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

BIG DATA USING HADOOP

Peninsula Strategy. Creating Strategy and Implementing Change

#TalendSandbox for Big Data

Hadoop & Big Data Market [Hardware, Software, Services, Hadoop-as-a- Service] - Trends, Geographical Analysis & Worldwide Market Forecasts ( )

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, And Forecast,

Transcription:

Setting the Direction for Big Data Benchmark Standards Chaitan Baru, Center for Large-scale Data Systems research (CLDS), San Diego Supercomputer Center, UC San Diego Milind Bhandarkar, Greenplum Raghunath Nambiar, Cisco Meikel Poess, Oracle Tilmann Rabl, U.Toronto

2 Characterizing the application environment Enormous data sizes (volume), high data rates (velocity), variety of data genres Datacenter issues: Large-scale and evolving system configurations, shifting loads, heterogeneous technologies Data genres: Structured, unstructured, graphs, streams, images, scientific data, etc Software options: SQL, NoSQL, Hadoop ecosystem, Hardware options: HDD vs SSD (NVM); different types of HDD, NVM, and main memory; large-memory systems; etc. Platform options: Dedicated commodity clusters vs shared cloud platforms

3 Industry s first workshop on big data benchmarking Acknowledgements National Science Foundation (Grant# IIS-1241838) SDSC industry sponsors: Seagate, Greenplum, NetApp, Mellanox, Brocade (event host) WBDB2012 steering committee Chaitan Baru, SDSC/UC San Diego Milind Bhandarkar, Greenplum/EMC Raghunath Nambiar, Cisco Meikel Poess, Oracle Tilmann Rabl, U of Toronto

4 Invited Attendee Organizations Actian AMD BMMsoft Brocade CA Labs Cisco Cloudera Convey Computer CWI/Monet Dell EPFL Facebook Google Greenplum Hewlett-Packard Hortonworks Indiana Univ / Hathitrust Research Foundation InfoSizing Intel LinkedIn MapR/Mahout Mellanox Microsoft NSF NetApp NetApp/OpenSFS Oracle San Diego Supercomputer Center SAS Scripps Research Institute Seagate Shell SNIA Teradata Corporation Twitter UC Irvine Univ. of Minnesota Univ. of Toronto Univ. of Washington VMware WhamCloud Yahoo! Red Hat Meeting structure: 6 x 15mts invited presentations; 35 x 5mts lightning talks. Afternoons: Structured discussion sessions

Topics of discussions Audience: Who is the audience for this benchmark? Application: What application should we model? Single benchmark spec: Is it possible to develop a single benchmark to capture characteristics of multiple applications? Component vs. end-to-end benchmark. Is it possible to factor out a set of benchmark components, which can be isolated and plugged into an end-to-end benchmark(s)? Paper and Pencil vs Implementation-based. Should the implementation be specification-driven or implementation-driven? Reuse. Can we reuse existing benchmarks? Benchmark Data. Where do we get the data from? Innovation or competition? Should the benchmark be for innovation or competition?

Audience: Who is the primary audience for a big data benchmark? Customers Workload should preferably be expressed in English Or, a declarative Language (unsophisticated user) But, not a procedural language (sophisticated user) Want to compare among different vendors Vendors Would like to sell machines/systems based on benchmarks Computer science/hardware research is also an audience Niche players and technologies will emerge out of academia Will be useful to train students on specific benchmarking

Applications: What application should we model? An application that somebody could donate? An application based on empirical data? Examples from scientific applications Multi-channel retailer-based application, like the amended TPC-DS for Big Data? Mature schema, large scale data generator, execution rules, audit process exists. Abstraction of an Internet-scale application, e.g. data management at the Facebook site, with synthetic data

Single Benchmark vs Multiple Is it possible to develop a single benchmark to represent multiple applications? Yes, but not desired if there is no synergy between the applications, e.g. say, at the data model level Synthetic Facebook application might provide context for a single benchmark Click streams, data sorting/indexing, weblog processing, graph traversals, image/video data,

Component benchmark vs. end-toend benchmark Are there components that can be isolated and plugged into an end-to-end benchmark? The benchmark should consist of individual components that ultimately make up an end-to-end benchmark The benchmark should include a component that extracts large data Many data science applications extract large data and then visualize the output Opportunity for pushing down viz into the data management system

Paper and Pencil / Specification driven versus Implementation driven Start with an implementation and develop specification at the same time Some post-workshop activity has begun in this area Data generation; sorting; some processing

Where Do we Get the Data From? Downloading data is not an option Data needs to be generated (quickly) Examples of actual datasets from scientific applications Observational data (e.g. LSST), simulation outputs Using existing data generators (TPC-DS, TPC-H) Data that is generic enough with good characteristics is better than specific data

Should the benchmark be for innovation or competition? Innovation and competition are not mutually exclusive Should be used for both The benchmark should be designed for competition, such a benchmark will then also be used internally for innovation TPC-H is a prime example of a benchmark model that could drive competition and innovation (if combined correctly)

Can we reuse existing benchmarks? Yes, we could but we need to discuss: How much augmentation is necessary? Can the benchmark data be scaled If the benchmark uses SQL, we should not require it Examples: but none of the following could be used unmodified Statistical Workload Injector for Map Reduce (SWIM) GridMix3 (lots of shortcomings) Open source TPC-DS YCSB++ (lots of shortcomings) Terasort strong sentiment for using this, perhaps as part of an end-to-end scenario #$%&'(""!"#$%&'&$!()*+,&-.$%&'&$/01(2$ - ')*+,+-.",/00-12"3).*45617"81-5"#16.,6*2+-."$1-*),,+.9" $)18-156.*)"%-/.*+:"" - 4220;<<===>20*>-19<20*?,<,0)*<20*?,@A>A>B>0?8" C4D"3/+:?"-."2-0"-8"#$%&'(E" F-:/5)";"" - G-"24)-1)2+*6:":+5+2" - #),2)?"/0"2-"ABB"#H"" F):-*+2D";"1-::+.9"/0?62),"" F61+)2D" - " - I6,D"2-"6??"24)"-24)1"2=-",-/1*)," " "!" " Ahmad Ghazal, Aster

Keep in mind principles for good benchmark design Self-scaling, e.g. TPC-C Comparability between scale factors Results should be comparable at different scales Technology agnostic (if meaningful to the application) TPC Simple to run + Longevity: TPC-C has carried the load for 20 years + Comparability Audit requirements and strict detailed run rules mean one can compare results published by two different entities + Scaling Results just as meaningful at the high-end of the market as at the lowend; as relevant on clusters as on single servers - Hard and expensive to run - No kit - DeWitt clauses Reza Taheri, VMWare 3

15 Extrapolating Results TPC benchmarks typically run on over-specified systems i.e. Customer installations may have less hardware than benchmark installation (SUT) Big Data Benchmarking may be opposite May need to run benchmark on systems that are smaller than customer installations Can we extrapolate? Scaling may be piecewise linear. Need to find those inflexion points

16 Interest in results from the Workshop June 9, 2012: Summary of Workshop on Big Data Benchmarking, C. Baru, Workshop on Architectures and Systems for Big Data, Portland, OR. June 22-23, 2012: Towards Industry Standard Benchmarks for Big Data, M. Bhandarkar, The Extremely Large Databases Conference at Asia, June 22-23, 2012. August 27-31, 2012: Setting the Direction for Big Data Benchmark Standards, Baru, Bhandarkar, Nambiar, Poess, Rabl, 4 th Int l TPC Technology Conference (TPCTC 2012), with VLDB2012, Istanbul, Turkey September 10-13, 2012: Introducing the Big Data Benchmarking Community, Lightning Talk and Poster at the 6 th Extremely Large Database Conference (XLDB), Stanford Univ, Palo Alto. September 20, SNIA Analytics and Big Data Summit, Santa Clara, CA. September 21, Workshop on Managing Big Data Systems, San Jose, CA.

17 Big Data Benchmarking Community (BDBC) Biweekly phone conferences Group communications and coordination is facilitated via phone calls every two weeks. See http://clds.sdsc.edu/bdbc/community for call-in information Contact tilmann.rabl@utoronto.ca or baru@sdsc.edu to join BDBC mailing list

Big Data Benchmarking Community Participants o Actian o Facebook o AMD o GaTech o Argonne National o Google o Labs o Greenplum o BMMsoft o Hortonworks o Brocade o Hewlett-Packard o CA Labs o IBM o Cisco o IndianaU/HTRF o Cloudera o InfoSizing o CMU o Intel o Convey Computer o Johns Hopkins U. o CWI/Monet o LinkedIn o DataStax o MapR/Mahout o Dell o Mellanox o EPFL o Microsoft o NetApp o Netflix o NIST o NSF o OpenSFS o Oracle o Ohio State U. o Paypal o PNNL o Red Hat o San Diego Supercomputer Center o SAS o Seagate o Shell o SLAC o SNIA o Teradata o Twitter o Univ. of Minnesota o UC Berkeley o UC Irvine o UC San Diego o Univ of Passau o Univ of Toronto o Univ. of Washington o VMware o WhamCloud o Yahoo! o

19 2 nd Workshop on Big Data Benchmarking: India December 17-18, 2012, Pune, India Hosted by Persistent Systems, India See http://clds.ucsd.edu/wbdb2012.in Themes: Application Scenarios/Use Cases Benchmark Process Benchmark Metrics Data Generation Format: Invited speakers; Lightning talks; Demos We welcome sponsorship Current Sponsors: NSF, Persistent Systems, Seagate, Greenplum, Brocade, Mellanox

20 3 rd Workshop on Big Data Benchmarking: China June 2013, Xian, China Hosted by Shanxi Supercomputing Center See http://clds.ucsd.edu/wbdb2013.cn Current Sponsors: IBM, Seagate, Greenplum, Mellanox, Brocade?

21 The Center for Large-scale Data Systems Research: CLDS At the San Diego Supercomputer Center, UC San Diego R&D center and forum for discussions on big datarelated topics Benchmarking; Project on Data Dynamics; Information Value; Focus on industry segments, e.g Healthcare Center Structure: Workshops; Program-level activities; Project-level activities For sponsorship opportunities: Contact: Chaitan Baru, SDSC, baru@sdsc.edu Thank you!