4th Workshop on Big Data Benchmarking

Similar documents
The BigData Top100 List Initiative. Chaitan Baru San Diego Supercomputer Center

The BigData Top100 List Initiative. Speakers: Chaitan Baru, San Diego Supercomputer Center, UC San Diego Milind Bhandarkar, Greenplum/EMC

Setting the Direction for Big Data Benchmark Standards

Setting the Direction for Big Data Benchmark Standards Chaitan Baru, PhD San Diego Supercomputer Center UC San Diego

How To Write A Bigbench Benchmark For A Retailer

SNW Panel Big Data and Cloud Benchmarking

BENCHMARKING BIG DATA SYSTEMS AND THE BIGDATA TOP100 LIST

Welcome to the 6 th Workshop on Big Data Benchmarking

Setting the Direction for Big Data Benchmark Standards 1

IEEE BigData 2014 Tutorial on Big Data Benchmarking

Wrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Industry Standards for Benchmarking Big Data Systems. Invited Talk Raghunath Nambiar, Cisco

Virtualizing Apache Hadoop. June, 2012

Oracle Big Data SQL Technical Update

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

MapReduce and Hadoop Distributed File System

Hadoop on the Gordon Data Intensive Cluster

Introducing EEMBC Cloud and Big Data Server Benchmarks

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Industry Standard for Benchmarking Big Data Systems

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Automating Big Data Benchmarking for Different Architectures with ALOJA

BigBench: Towards an Industry Standard Benchmark for Big DataAnalytics

Big Data With Hadoop

Big Data Patterns. Ron Bodkin Founder and President, Think Big

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

The Greenplum Analytics Workbench

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Luncheon Webinar Series May 13, 2013

Big Data Technologies Compared June 2014

Hadoop Usage At Yahoo! Milind Bhandarkar

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Pre-Conference Seminar E: Flash Storage Networking

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Comet - High performance virtual clusters to support the long-tail of science.

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

The Inside Scoop on Hadoop

Dell In-Memory Appliance for Cloudera Enterprise

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Manifest for Big Data Pig, Hive & Jaql

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Proact whitepaper on Big Data

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Big Data Can Drive the Business and IT to Evolve and Adapt

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Large scale processing using Hadoop. Ján Vaňo

BigBench: Towards an Industry Standard Benchmark for Big Data Analytics

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

HDP Hadoop From concept to deployment.

Performance and Scalability Overview

Shaping the Landscape of Industry Standard Benchmarks: Contributions of the Transaction Processing Performance Council (TPC)

Red Hat Enterprise Linux is open, scalable, and flexible

Big Data Generation. Tilmann Rabl and Hans-Arno Jacobsen

Scientific Computing Data Management Visions

Computing. Chaitan Baru San Diego Supercomputer Center. Competitive Advantage Through Cloud Computing

Amazon EC2 Product Details Page 1 of 5

MapReduce and Hadoop Distributed File System V I J A Y R A O

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop: Embracing future hardware

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Development of a Computational and Data-Enabled Science and Engineering Ph.D. Program

Hadoop & Spark Using Amazon EMR

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Information Architecture

Hadoop & its Usage at Facebook

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Cost-Effective Business Intelligence with Red Hat and Open Source

Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation

Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data

Big Data Defined Introducing DataStack 3.0

How To Scale Out Of A Nosql Database

Structured data meets unstructured data in Azure and Hadoop

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

NoSQL for SQL Professionals William McKnight

GraySort and MinuteSort at Yahoo on Hadoop 0.23

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

Get More Scalability and Flexibility for Big Data

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Platfora Big Data Analytics

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

QLogic 16Gb Gen 5 Fibre Channel for Database and Business Analytics

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Transcription:

4th Workshop on Big Data Benchmarking

4th WBDB: Welcome and Introduction Chaitan Baru Associate Director, Data Initiatives San Diego Supercomputer Center Director, Center for Large-scale Data Systems Research University of California San Diego

3 Thanks! Brocade: Providing the venue+catering Sheri Mukai; Michele Limbocker; Suresh Vobillisetty CLDS sponsors: Pivotal, Intel, NetApp, Seagate CLDS Organizing Committee Speakers/attendees Springer-Verlag

4 CLDS: Center for Large-scale Data Systems Research R&D activity within San Diego Supercomputer Center Current projects/activities Big Data Benchmarking Opportunity to work with CS graduate students Data Value How Much Information CSE Master of Advanced Studies (MAS) in Big Data Science SDSC Data Science Institute Initiative focused on onsite education and training in Data Science for industry

5 SDSC A national and UC-based center for highperformance computing and data-intensive computing (big data) Established >25 years ago Engaged in Research + Development + Production (RDP) Offers datacenter services to UC, also non-uc and industry partners

Comet: System Characteristics Planned for Jan 2015 Total flops ~1.8-2.0 PF Dell primary integrator Intel processors Mellanox InfiniBand Aeon storage vendor Standard compute nodes Intel next-gen processors 128 GB DRAM 320GB SSD Large-memory nodes 1.5TB DRAM GPU nodes Hybrid fat-tree topology FDR InfiniBand Rack-level full bisection bandwidth (72 nodes) 4:1 oversubscription cross-rack Performance Storage 7 PB, 200 GB/s Scratch & Persistent Storage Durable Storage (reliability) 6 PB disk Gateway hosting nodes and VM image repository 100 Gbps external connectivity

7 WBDB Background Genesis of this effort NSF Cluster Exploratory (CluE) research project On Performance Evaluation of On-Demand Provisioning of Data Intensive Applications (2009-2012) Led to a study of benchmarks to compare Hadoop and relational DBMS Launched Workshops on Big Data Benchmarking Funded by NSF and industry sponsorships 1 st WBDB: May 2012, San Jose. Hosted by Brocade 2 nd WBDB: December 2012, Pune, India. Hosted by Persistent Systems / Infosys 3 rd WBDB: July 2013, Xi an, China. Hosted by Xi an University ~130 attendees (including duplicates) + ~40 today

8 1 st WBDB Attendee Organizations Actian AMD BMMsoft Brocade CA Labs Cisco Cloudera Convey Computer CWI/Monet Dell EPFL Facebook Google Greenplum Hewlett-Packard Hortonworks Indiana Univ / Hathitrust Research Foundation InfoSizing Intel LinkedIn MapR/Mahout Mellanox Microsoft NSF NetApp NetApp/OpenSFS Oracle San Diego Supercomputer Center SAS Scripps Research Institute Seagate Shell SNIA Teradata Corporation Twitter UC Irvine Univ. of Minnesota Univ. of Toronto Univ. of Washington VMware WhamCloud Yahoo! Red Hat

9 4th WBDB: http://clds.sdsc.edu/wbdb2013.us 3rd WBDB: http://clds.sdsc.edu/wbdb2013.cn

10 WBDB Outcomes Big Data Benchmarking Community (BDBC) mailing list (~160 members from ~75 organizations) (Remote) Talks every other Thursday at 9AM US Pacific time Selected papers to be published in Springer Verlag LNCS: 2012 and 2013 Issues Paper from First Workshop Setting the Direction for Big Data Benchmark Standards by C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, and T. Rabl, published in Selected Topics in Performance Evaluation and Benchmarking, Springer-Verlag Article in inaugural issue of Big Data Journal Big Data Benchmarking and the Big Data Top100 List by Baru, Bhandarkar, Nambiar, Poess, Rabl, Big Data Journal, Vol.1, No.1, 60-64, Anne Liebert Publications. Formation of the TPC-BD Subcommittee on BigData benchmarking

11 Current Status: Issues Discussed at the Workshops Different types of benchmarks for different aspects of a system Micro-benchmarks. Specific lower-level, system operations I/O operations, e.g. A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU Functional benchmarks Terasort Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, Genre-specific benchmarks E.g. Graph500 Application-level benchmarks Measure system-level performance of hardware and software, for a given dataset and workload (a given application scenario) E.g., TPC benchmarks: TPC-C, TPC-H, TPC-DS,

Benchmark Design Issues Audience: Who is the audience for such a benchmark? Marketing (Customers / End users), Internal Use (Engineering), Academic Use Application: What is the application that should be modeled? Abstractions of a data pipeline, e.g. Internet-scale business Should the benchmark be for innovation or competition? Successful competitive benchmarks will be used for innovation

13 Design Issues - 2 Single benchmark specification: Is it possible to develop a single benchmark to capture characteristics of multiple applications? Single, multi-step benchmark, with plausible end-to-end scenario Component vs. end-to-end benchmark. Is it possible to factor out a set of benchmark components, which can be isolated and plugged into an end-to-end benchmark? The benchmark should consist of individual components that ultimately make up an end-to-end benchmark

Design Issues - 3 Paper and Pencil vs Implementation-based. Should the implementation be specification-driven or implementation-driven? Start with an implementation and develop specification at the same time Reuse. Can we reuse existing benchmarks? Leverage existing work and built-up knowledgebase Benchmark Data. Where do we get the data from? Synthetic data generation: structured, non-structured data Verifiability. Should there be a process for verification of results? YES!

15 Abstractions of the Big Data World from WBDB Enterprise Warehouse + Agglomeration of other data Structured enterprise data warehouse Extended to incorporate data from other non-fully structured data sources (e.g. weblogs, text, streams) Pool of data with sequence of processing Enterprise data processing as a pipeline from data ingestion to transformation, extraction, subsetting, machine learning, predictive analytics Data from multiple structured and non-structured sources

16 Proposal 1: BigBench Ghazal et al: Teradata, Oracle, U.of Toronto, InfoSizing Derived from TPC-Decision Support (TPC-DS) Multiple snowflake schemas with shared dimensions 24 tables with an average of 18 columns 99 distinct SQL 99 queries with random substitutions More representative skewed database content Sub-linear scaling of non-fact tables Ad-hoc, reporting, iterative and extraction queries ETL-like data maintenance

17 BigBench Data Model Workload = Set of queries On structured, semistructured, unstructured data Data mining, ML Paper published in ACM SIGMOD 2013. Full specification to appear in WBDB2012 publication

18 Proposal 2: Deep Analytics Pipeline An end-to-end data processing pipline: Data from multiple sources Loose, flexible schema Data requires structuring ELT rather than ETL Application characteristics Processing pipelines Running models with data Acquisition/ Recording Extraction/ Cleaning/ Annotation Integration/ Aggregation/ Representation Analysis/ Modeling Interpretation

19 Example of an Application: User Modeling Objective: Determine user interests by mining user activities Large dimensionality of possible user activities Typical user has sparse activity vector Event attributes change over time

20 User Modeling Pipeline Data Acquisition Sessionization Feature and Target Generation Model Training Offline Scoring & Evaluation Batch Scoring & Upload to serving

21 Next Steps TPC-BD subcommittee Join TPC if you want to influence that process BigData Top100 List An open, community effort to rank systems by performance (with price/performance) on Big Data workloads HPC meets enterprise : Combine ideas from TPC and Top500 TPC has influenced design and efficiency of DBMSs over 25 years Borrow ranking concept from Top500 But, include price/performance and green metrics

22 Next Steps: BigData Community Challenges Challenges related to the Deep Analytics Pipeline Definition of each step Ideas for machine learning and predictive analytics steps Ideas for metrics: performance and price/ performance Announce competitions via Kaggle and other venues

23 5 th WBDB Would like to host it in Europe Germany? around Summer 2014 Looking for interested hosts, sponsors, local organizers,