4th Workshop on Big Data Benchmarking

4th WBDB: Welcome and Introduction Chaitan Baru Associate Director, Data Initiatives San Diego Supercomputer Center Director, Center for Large-scale Data Systems Research University of California San Diego

3 Thanks! Brocade: Providing the venue+catering Sheri Mukai; Michele Limbocker; Suresh Vobillisetty CLDS sponsors: Pivotal, Intel, NetApp, Seagate CLDS Organizing Committee Speakers/attendees Springer-Verlag

4 CLDS: Center for Large-scale Data Systems Research R&D activity within San Diego Supercomputer Center Current projects/activities Big Data Benchmarking Opportunity to work with CS graduate students Data Value How Much Information CSE Master of Advanced Studies (MAS) in Big Data Science SDSC Data Science Institute Initiative focused on onsite education and training in Data Science for industry

5 SDSC A national and UC-based center for highperformance computing and data-intensive computing (big data) Established >25 years ago Engaged in Research + Development + Production (RDP) Offers datacenter services to UC, also non-uc and industry partners

Comet: System Characteristics Planned for Jan 2015 Total flops ~1.8-2.0 PF Dell primary integrator Intel processors Mellanox InfiniBand Aeon storage vendor Standard compute nodes Intel next-gen processors 128 GB DRAM 320GB SSD Large-memory nodes 1.5TB DRAM GPU nodes Hybrid fat-tree topology FDR InfiniBand Rack-level full bisection bandwidth (72 nodes) 4:1 oversubscription cross-rack Performance Storage 7 PB, 200 GB/s Scratch & Persistent Storage Durable Storage (reliability) 6 PB disk Gateway hosting nodes and VM image repository 100 Gbps external connectivity

7 WBDB Background Genesis of this effort NSF Cluster Exploratory (CluE) research project On Performance Evaluation of On-Demand Provisioning of Data Intensive Applications (2009-2012) Led to a study of benchmarks to compare Hadoop and relational DBMS Launched Workshops on Big Data Benchmarking Funded by NSF and industry sponsorships 1 st WBDB: May 2012, San Jose. Hosted by Brocade 2 nd WBDB: December 2012, Pune, India. Hosted by Persistent Systems / Infosys 3 rd WBDB: July 2013, Xi an, China. Hosted by Xi an University ~130 attendees (including duplicates) + ~40 today

8 1 st WBDB Attendee Organizations Actian AMD BMMsoft Brocade CA Labs Cisco Cloudera Convey Computer CWI/Monet Dell EPFL Facebook Google Greenplum Hewlett-Packard Hortonworks Indiana Univ / Hathitrust Research Foundation InfoSizing Intel LinkedIn MapR/Mahout Mellanox Microsoft NSF NetApp NetApp/OpenSFS Oracle San Diego Supercomputer Center SAS Scripps Research Institute Seagate Shell SNIA Teradata Corporation Twitter UC Irvine Univ. of Minnesota Univ. of Toronto Univ. of Washington VMware WhamCloud Yahoo! Red Hat

9 4th WBDB: http://clds.sdsc.edu/wbdb2013.us 3rd WBDB: http://clds.sdsc.edu/wbdb2013.cn

10 WBDB Outcomes Big Data Benchmarking Community (BDBC) mailing list (~160 members from ~75 organizations) (Remote) Talks every other Thursday at 9AM US Pacific time Selected papers to be published in Springer Verlag LNCS: 2012 and 2013 Issues Paper from First Workshop Setting the Direction for Big Data Benchmark Standards by C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, and T. Rabl, published in Selected Topics in Performance Evaluation and Benchmarking, Springer-Verlag Article in inaugural issue of Big Data Journal Big Data Benchmarking and the Big Data Top100 List by Baru, Bhandarkar, Nambiar, Poess, Rabl, Big Data Journal, Vol.1, No.1, 60-64, Anne Liebert Publications. Formation of the TPC-BD Subcommittee on BigData benchmarking

11 Current Status: Issues Discussed at the Workshops Different types of benchmarks for different aspects of a system Micro-benchmarks. Specific lower-level, system operations I/O operations, e.g. A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU Functional benchmarks Terasort Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, Genre-specific benchmarks E.g. Graph500 Application-level benchmarks Measure system-level performance of hardware and software, for a given dataset and workload (a given application scenario) E.g., TPC benchmarks: TPC-C, TPC-H, TPC-DS,

Benchmark Design Issues Audience: Who is the audience for such a benchmark? Marketing (Customers / End users), Internal Use (Engineering), Academic Use Application: What is the application that should be modeled? Abstractions of a data pipeline, e.g. Internet-scale business Should the benchmark be for innovation or competition? Successful competitive benchmarks will be used for innovation

13 Design Issues - 2 Single benchmark specification: Is it possible to develop a single benchmark to capture characteristics of multiple applications? Single, multi-step benchmark, with plausible end-to-end scenario Component vs. end-to-end benchmark. Is it possible to factor out a set of benchmark components, which can be isolated and plugged into an end-to-end benchmark? The benchmark should consist of individual components that ultimately make up an end-to-end benchmark

Design Issues - 3 Paper and Pencil vs Implementation-based. Should the implementation be specification-driven or implementation-driven? Start with an implementation and develop specification at the same time Reuse. Can we reuse existing benchmarks? Leverage existing work and built-up knowledgebase Benchmark Data. Where do we get the data from? Synthetic data generation: structured, non-structured data Verifiability. Should there be a process for verification of results? YES!

15 Abstractions of the Big Data World from WBDB Enterprise Warehouse + Agglomeration of other data Structured enterprise data warehouse Extended to incorporate data from other non-fully structured data sources (e.g. weblogs, text, streams) Pool of data with sequence of processing Enterprise data processing as a pipeline from data ingestion to transformation, extraction, subsetting, machine learning, predictive analytics Data from multiple structured and non-structured sources

16 Proposal 1: BigBench Ghazal et al: Teradata, Oracle, U.of Toronto, InfoSizing Derived from TPC-Decision Support (TPC-DS) Multiple snowflake schemas with shared dimensions 24 tables with an average of 18 columns 99 distinct SQL 99 queries with random substitutions More representative skewed database content Sub-linear scaling of non-fact tables Ad-hoc, reporting, iterative and extraction queries ETL-like data maintenance

17 BigBench Data Model Workload = Set of queries On structured, semistructured, unstructured data Data mining, ML Paper published in ACM SIGMOD 2013. Full specification to appear in WBDB2012 publication

18 Proposal 2: Deep Analytics Pipeline An end-to-end data processing pipline: Data from multiple sources Loose, flexible schema Data requires structuring ELT rather than ETL Application characteristics Processing pipelines Running models with data Acquisition/ Recording Extraction/ Cleaning/ Annotation Integration/ Aggregation/ Representation Analysis/ Modeling Interpretation

19 Example of an Application: User Modeling Objective: Determine user interests by mining user activities Large dimensionality of possible user activities Typical user has sparse activity vector Event attributes change over time

20 User Modeling Pipeline Data Acquisition Sessionization Feature and Target Generation Model Training Offline Scoring & Evaluation Batch Scoring & Upload to serving

21 Next Steps TPC-BD subcommittee Join TPC if you want to influence that process BigData Top100 List An open, community effort to rank systems by performance (with price/performance) on Big Data workloads HPC meets enterprise : Combine ideas from TPC and Top500 TPC has influenced design and efficiency of DBMSs over 25 years Borrow ranking concept from Top500 But, include price/performance and green metrics

22 Next Steps: BigData Community Challenges Challenges related to the Deep Analytics Pipeline Definition of each step Ideas for machine learning and predictive analytics steps Ideas for metrics: performance and price/ performance Announce competitions via Kaggle and other venues

23 5 th WBDB Would like to host it in Europe Germany? around Summer 2014 Looking for interested hosts, sponsors, local organizers,