Benchmarking and Ranking Big Data Systems

Similar documents
Big Data Benchmark Suite

Apache Flink Next-gen data analysis. Kostas

BPOE Research Highlights

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Spark and the Big Data Library

Unified Big Data Analytics Pipeline. 连 城

Spark: Cluster Computing with Working Sets

Beyond Hadoop with Apache Spark and BDAS

Challenges for Data Driven Systems

Spark: Making Big Data Interactive & Real-Time

The Internet of Things and Big Data: Intro

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

On Big Data Benchmarking

HiBench Introduction. Carson Wang Software & Services Group

How To Write A Bigbench Benchmark For A Retailer

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

BigDataBench: a Big Data Benchmark Suite from Internet Services

Scaling Out With Apache Spark. DTL Meeting Slides based on

Unified Big Data Processing with Apache Spark. Matei

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Architecture Support for Big Data Analytics

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Duke University

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Flink. Fast and Reliable Large-Scale Data Processing

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data With Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

CSE-E5430 Scalable Cloud Computing Lecture 2

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Modern Processors using BigDataBench

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

arxiv: v1 [cs.db] 26 May 2015

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big Data and Scripting Systems build on top of Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Big Data Analytics Hadoop and Spark

On Big Data Benchmarking

Machine Learning over Big Data

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

What s next for the Berkeley Data Analytics Stack?

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Hadoop & Spark Using Amazon EMR

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Graph Processing and Social Networks

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Machine- Learning Summer School

Large-Scale Data Processing

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Next-Gen Big Data Analytics using the Spark stack

Big Data and Data Science: Behind the Buzz Words

BigDataBench. Khushbu Agarwal

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

A Brief Introduction to Apache Tez

Architectures for massive data management

How To Scale Out Of A Nosql Database

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Hybrid Software Architectures for Big

Big Systems, Big Data

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Journée Thématique Big Data 13/03/2015

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Big Data Processing. Patrick Wendell Databricks

GraySort on Apache Spark by Databricks

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

The Stratosphere Big Data Analytics Platform

Big Data Systems CS 5965/6965 FALL 2015

Bayesian networks - Time-series models - Apache Spark & Scala

CS54100: Database Systems

Map-Reduce for Machine Learning on Multicore

Hadoop-BAM and SeqPig

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Advanced In-Database Analytics

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Ali Ghodsi Head of PM and Engineering Databricks

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Big Data and Scripting Systems beyond Hadoop

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Big Data and Apache Hadoop s MapReduce

Transcription:

Benchmarking and Ranking Big Data Systems Xinhui Tian ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences INSTITUTE OF COMPUTING TECHNOLOGY

Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Ranking big data systems BPOE 2016 2

Why Big Data Benchmarking? Measuring big data systems and architectures quantitatively BPOE 2016 3

What is BigDataBench? n An open source big data benchmarking project http://prof.ict.ac.cn/bigdatabench Search Google using BigDataBench BPOE 2016 4

BigDataBench Users n http://prof.ict.ac.cn/bigdatabench/users/ n Industry users n Accenture, BROADCOM, SAMSUMG, Huawei, IBM n About 20 academia groups published papers using BigDataBench BPOE 2016 5

BigDataBench 3.2 Overview BDGS(Big Data Generator Suite) for scalable data Wikipedia Entries Amazon Movie Reviews Google Web Graph Facebook Social NetworkE- commerce Transaction ImageNet English broadcasting audio ProfSearch Resumes DVD Input Streams Image scene Genome sequence data Assembly of the human genome SoGou Data MNIST MovieLens Dataset 15 Real- world Data Sets Impala NoSql Search Engine Multimedia Social E-commerce Network Bioinformatics 37 Workloads Shark Hadoop RDMA MPI DataMPI Software Stacks BPOE 2016 6

What s New in BigDataBench 3.2 BigDataBench support for Flink WordCount, Grep, Naïve Bayes, PageRank, K- means Streaming JStorm, Spark Streaming Graph processing GraphX, GraphLab, Flink Gelly BPOE 2016 7

BigDataBench evolution 2015.12 New software stack: Flink, JStorm, GraphX, GraphLab New workload type: Streaming, Graph processing New dataset and workloads BigDataBench 3.2 2014.12 5 application domains: 14 data sets and 33 workloads Same specifications: diverse implementations Multi- tenancy version BigDataBench subset and simulator version BigDataBench 3.1 2014.4 Multidisciplinary effort 32 workloads: diverse implementations BigDataBench 3.0 2013.12 Typical Internet service domains An architectural perspective 19 workloads & data generation tools BigDataBench 2.0 2013.7 Search engine 6 workloads BigDataBench 1.0 11 data analytics workloads DCBench 1.0 Mixed data analytics workloads CloudRank 1.0 BPOE 2016 8

BigDataBench Publications n n n n n n BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA- 2014). Characterizing data analysis workloads in data centers. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award) BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014) BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014) Identifying Dwarfs Workloads in Big Data Analytics arxiv preprint arxiv:1505.06872 BigDataBench- MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads arxiv preprint arxiv:1504.02205 BPOE 2016 9

Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Remarking big data systems BPOE 2016 10

BigDataBench Methodology Application Domain 1 Benchmark specification 1 Real- world data sets Multi- tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset BPOE 2016 11

BigDataBench Methodology with Dwarfs Application Domain 1 Benchmark specification 1 Real- world data sets Multi- tenancy version Application Domain Data models of different types & semantics Dwarfs Workloads Identification Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset BPOE 2016 12

Dwarfs Workloads n Finding the common atomic workloads in big data analytics workloads n Representing maximum patterns of big data analytics using a minimum workload set BPOE 2016 13

Inspiration Successful Compute Abstractions Relational algebra 5 primitive operations Select, Project, Product, Union, Difference Successful Benchmarks TPC- C OLTP domain Functions of abstraction Parallel computing Computational & communication patterns 13 dwarfs HPCC High performance computing Seven basically tests BPOE 2016 14

Fundamental Issues What are the challenges for big data dwarfs? How to find the dwarfs workloads in big data analytics? BPOE 2016 15

Challenges Deep learning Classification Data mining 80% d ata g rowth Dimension Clustering reduction Many echniques for plrocessing b ig d ata e xist, w hich b ring A mtachine lcomputer earning ibrary scikit learn, i mplements Machine are u nstructured Massive application make uds warfs wonder where learning vision domains greater c omplexity f or i dentifying w orkloads Data M ining so many algorithms, which is still much less than the data to start or how to achieve a wide range of coverage total number of algorithms Natural language processing Signal processi ng Regression BPOE 2016 Association rule mining 16

Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE 2016 17

Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE 2016 18

Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE 2016 19

Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE 2016 20

Algorithms Chosen to Investigate Computer Vision Database Software Deep Learning MPEG- 2, Scale- invariant feature transform, Image segmentation, Ray Tracing Needleman- Wunsch, Smith- Waterman, BLAST CNN, DBN Data Mining & Machine Learning C4.5/CART/ID3, Logistic regression, SVM, KNN, HMM, Maximum- entropy markov model, Conditional random field, PageRank, HITS, Aporiori, FP- growth, Principal component analysis, Back Propagation, Adaboost, MCMC, Connected component, Random forest Natural Language Processing Latent semantic indexing, plsi, Latent dirichlet allocation, Index, Porter Stemming, Sphinx speech recognition Recommendation Demographic/Content based recommendation, Collaborative Filtering BPOE 2016 21

Frequently- appearing Operations Similarity Analysis Sampling Statistic Operation KNN, K- means, Recommendation, MCMC, LDA, Feature matching, Random Image sampling, segmentation Downsampling Probability calculation LSI, plsi, Latent dirichletallocation MPEG- 2 Association Set Operations Rules Linear Algebra Mining Sort Graph Operation Neural Jaccard Network similarity Transform Similarity Analysis Operation Locality sensitive hashing FFT, Convolution Back computation, propagation, DCT, CNN, DBN, Neural Speech recognition, network Multimedia Partial Representation sort, quick sort, Top BFS, DFS, Decision Encrpytion, Matrix/Vector k sort index, Calculation tree, Connected Fingerprint, SimHash, Set K- means, Decision tree union Component Sphinx speech recognition, MPEG- 2, SIFT, Image segmentation, Ray tracing Logic Operation MinHash Database Aporiori FP- Growth Set difference SVM, HMM, MEMM, CRF, PageRank, HITS, Logistic regression BPOE 2016 22

Dwarfs in Big Data Offline Analytics Linear Algebra Sampling Transform operation Graph operation Logic operation Set operation Statistic operation Sort BPOE 2016 23

Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Ranking big data systems BPOE 2016 24

BigData100 Ranking n An open source project for benchmarking and ranking big data systems n Using benchmarks from BigDataBench n Big data systems including Hadoop, Spark, Flink and many SQL- interface systems n http://www.bafst.com/items/top100/index.html BPOE 2016 25

Why Do Big Data Systems Ranking n First a quick glance of current big data ecosystem BPOE 2016 26

Big Data Systems Programming Interface SQL-like DataFlow APIs R Language Computation Engines DB layer Graph Layer ML Layer DataFlow Engines Resource Management Graph MPP Streaming Data Storage Column-oriented Format Distributed File Systems Key-Value Storage NOSQL Relational DB Data Sources Structured Data Unstructured Data Semi-structured Data BPOE 2016 27

Computation Engine Programming Model Intermediate Data Management Network Transfer Local Execution Task Scheduling Fault Tolerance Parallel Computation BPOE 2016 28

The Dataflow Model n A computation job is represented as a directed acyclic graph (DAG) consisting of data sources, sinks and operators n First used in the Microsoft Dryad system n Google MapReduce can be seen as a special case of DAG with only two types of operators BPOE 2016 29

DataFlow MapReduce Dryad Input Input1 Input2 Map Sort Operator1 Channel1 Operator2 Channel2 Intermediate Data Transformation Spill Operator3 Channel3 Merge Operator4 Channel4 Reduce Operator5 Output Output User function Spark Input1 Input2 Cached RDD1 RDD6 Transformation1 RDD2 Cached RDD Transformation2 RDD3 Transformation3 RDD4 Transformation4 RDD5 Action Flink Output Input1 Input2 PACT1 Channel1 PACT2 Channel2 PACT3 Channel3 PACT4 Channel4 PACT5 BPOE 2016Output 30

Other Specific Models Impala Pregel Malewicz G, et al. Pregel: a system for large- scale graph processing[c] Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146. Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open- Source SQL Engine for Hadoop[C]//CIDR. 2015. BPOE 2016 31

Why Do Big Data Systems Ranking? n Understanding the performance difference between general- purpose and specific systems n performance comparison n Figuring out how the design of framework impact the performance of specific types of workloads n behavior analysis BPOE 2016 32

Benchmark n Current Benchmarks n Micro workloads: WordCount, Grep n Iteration workloads: PageRank, Kmeans n Interactive Queries Select, Aggregation, Join n Others Naïve Bayes BPOE 2016 33

Benchmark Behaviors: Spark Case n Spark n Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage Can be cached for efficient reuse HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...)) BPOE 2016 34

Benchmark Behaviors n Wordcount, Grep, NaiveBayes, Select n IO- intensive n Computation can be easy paralleled n No or little data transfer DataSet1 Map DataSet2 BPOE 2016 35

Benchmark Behaviors n PageRank, Kmeans n Iterative Computation n Data sets join and aggregation in each iteration n A lot of shuffle operations DataSet1 Iteration 1 Iteration 2 Iteration 3 GroupBy DataSet2 GroupBy DataSet2 GroupBy DataSet2 Join Join Join Map Map Map BPOE 2016 36

Benchmark Behaviors n Database queries n One DAG job that consists of many different transformations Including complex aggregation and join algorithms DataSet1 Map DataSet2 Map Join Partial Aggregation Full Aggregation DataSet3 BPOE 2016 37

Workload Behaviors on CPU and Memory (JVM) n Simple Batch Workloads n Wordcount, Grep, NaiveBayes, Select n Iterative Workloads n PageRank, Kmeans n Interactive Queries n Aggregation, Join BPOE 2016 38

CPU - Simple Batch Workloads WordCount NaiveBayes Grep BPOE 2016 39

CPU - Iterative Workloads PageRank KMeans BPOE 2016 40

CPU - Query Processing Aggregation Join BPOE 2016 41

Memory - JVM Memory Model http://www.journaldev.com/2856/java- jvm- memory- model- and- garbage- collection- monitoring- tuning BPOE 2016 42

JVM Memory Usage in Spark n EC: Eden Capacity n EU: Eden Usage n OC: Old Memory Capacity n OU: Old Memory Usage BPOE 2016 43

Simple Batch Workloads WordCount NaiveBayes Grep Select BPOE 2016 44

Iterative Workloads PageRank KMeans BPOE 2016 45

Query Processing Aggregation Join BPOE 2016 46

Revisit the dataflow Iteration 1 Iteration 2 Iteration 3 GroupBy DataSet2 GroupBy DataSet2 GroupBy DataSet2 Join Join Join Map Map Map DataSet1 Map DataSet2 Map Join Partial Aggregation Full Aggregation DataSet3 BPOE 2016 47

Some Primary Results of n Cluster Configuration BigData100 Computation Nodes CPU per node Memory per node Disk per node Network 16 nodes 2 x Intel Xeon E5645, 12 cores 32 GB 1TB x 2 SATA disks Gigabit Ethernet BPOE 2016 48

System Selected n Simple Batch Workloads and Iterative Workloads n Hadoop 2.7.1, Spark 1.5.1, Flink 0.9.1 n Interactive Query Processing n Spark 1.5.1, Hive 1.2.1 (on Hadoop, on Tez and on Spark), Impala 1.4.0 BPOE 2016 49

Size of data sets n WordCount, Grep n 1.600,000,000 English words from Wikipedia n About 300GB n Naïve Bayes n 5,700,000 reviews from Amazon Movie Reviews n 300GB BPOE 2016 50

Size of data sets n PageRank n Google Web Graph, 16 million vertexes, 99 million edges (unstructured graph) n KMeans n Facebook Social Network, 460 million vectors n Queries subset of TPC- DS n item: 4 GB n customer: 13 GB n order: 100 GB BPOE 2016 51

Results Simple Batch Workloads WordCount Grep 1200 1105 600 548 RunTime (s) 1000 800 600 400 680 345 Run Time (s) 500 400 300 200 410 330 200 100 0 Hadoop Spark Flink 0 Hadoop Spark Flink BPOE 2016 52

Results 600 500 490 Naive Bayes 563 Run Time (s) 400 300 351 200 100 0 Hadoop Spark Flink BPOE 2016 53

Observation n Spark has the best performance on that kind of workloads n Good data locality n One task for each partition n One reason for the bad performance in Flink n High level data locality Considering the whole file BPOE 2016 54

Results Iterative Workloads PageRank KMeans 9000 8000 7753 8000 7000 6722 Run Time (s) 7000 6000 5000 4000 3000 2000 1000 780 598 469 Run Time(s) 6000 5000 4000 3000 2000 1000 672 805 0 Hadoop Spark Flink Flink- delta 0 Hadoop Spark Flink BPOE 2016 55

Observation n Spark, Flink can be much faster than Hadoop n Advantage of the DAG model n The delta iteration from Flink can be more efficient than the bulk model used in Spark BPOE 2016 56

Results Queries 70 60 Select 58 56 65 Run Time(s) 50 40 30 20 10 38 9 0 SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE 2016 57

Results - Queries Aggregation 600 500 490 Run Time(s) 400 300 200 155 174 248 123 100 0 SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE 2016 58

Results Queries Join Run Time(s) 900 800 700 600 500 400 300 200 100 0 791 593 342 318 214 SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE 2016 59

Observations n Impala has the best performance on select and aggregation n Specific data structure designed for database query n Good disk performance due to I/O buffer managemen and short- circuit local reads n Impala gets bad performance for join operation in this situation n Big pressure on a small part of nodes BPOE 2016 60

Thanks BPOE 2016 61