Benchmarking and Ranking Big Data Systems

Benchmarking and Ranking Big Data Systems Xinhui Tian ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences INSTITUTE OF COMPUTING TECHNOLOGY

Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Ranking big data systems BPOE 2016 2

Why Big Data Benchmarking? Measuring big data systems and architectures quantitatively BPOE 2016 3

What is BigDataBench? n An open source big data benchmarking project http://prof.ict.ac.cn/bigdatabench Search Google using BigDataBench BPOE 2016 4

BigDataBench Users n http://prof.ict.ac.cn/bigdatabench/users/ n Industry users n Accenture, BROADCOM, SAMSUMG, Huawei, IBM n About 20 academia groups published papers using BigDataBench BPOE 2016 5

BigDataBench 3.2 Overview BDGS(Big Data Generator Suite) for scalable data Wikipedia Entries Amazon Movie Reviews Google Web Graph Facebook Social NetworkE- commerce Transaction ImageNet English broadcasting audio ProfSearch Resumes DVD Input Streams Image scene Genome sequence data Assembly of the human genome SoGou Data MNIST MovieLens Dataset 15 Real- world Data Sets Impala NoSql Search Engine Multimedia Social E-commerce Network Bioinformatics 37 Workloads Shark Hadoop RDMA MPI DataMPI Software Stacks BPOE 2016 6

What s New in BigDataBench 3.2 BigDataBench support for Flink WordCount, Grep, Naïve Bayes, PageRank, K- means Streaming JStorm, Spark Streaming Graph processing GraphX, GraphLab, Flink Gelly BPOE 2016 7

BigDataBench evolution 2015.12 New software stack: Flink, JStorm, GraphX, GraphLab New workload type: Streaming, Graph processing New dataset and workloads BigDataBench 3.2 2014.12 5 application domains: 14 data sets and 33 workloads Same specifications: diverse implementations Multi- tenancy version BigDataBench subset and simulator version BigDataBench 3.1 2014.4 Multidisciplinary effort 32 workloads: diverse implementations BigDataBench 3.0 2013.12 Typical Internet service domains An architectural perspective 19 workloads & data generation tools BigDataBench 2.0 2013.7 Search engine 6 workloads BigDataBench 1.0 11 data analytics workloads DCBench 1.0 Mixed data analytics workloads CloudRank 1.0 BPOE 2016 8

BigDataBench Publications n n n n n n BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA- 2014). Characterizing data analysis workloads in data centers. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award) BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014) BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014) Identifying Dwarfs Workloads in Big Data Analytics arxiv preprint arxiv:1505.06872 BigDataBench- MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads arxiv preprint arxiv:1504.02205 BPOE 2016 9

Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Remarking big data systems BPOE 2016 10

BigDataBench Methodology Application Domain 1 Benchmark specification 1 Real- world data sets Multi- tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset BPOE 2016 11

BigDataBench Methodology with Dwarfs Application Domain 1 Benchmark specification 1 Real- world data sets Multi- tenancy version Application Domain Data models of different types & semantics Dwarfs Workloads Identification Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset BPOE 2016 12

Dwarfs Workloads n Finding the common atomic workloads in big data analytics workloads n Representing maximum patterns of big data analytics using a minimum workload set BPOE 2016 13

Inspiration Successful Compute Abstractions Relational algebra 5 primitive operations Select, Project, Product, Union, Difference Successful Benchmarks TPC- C OLTP domain Functions of abstraction Parallel computing Computational & communication patterns 13 dwarfs HPCC High performance computing Seven basically tests BPOE 2016 14

Fundamental Issues What are the challenges for big data dwarfs? How to find the dwarfs workloads in big data analytics? BPOE 2016 15

Challenges Deep learning Classification Data mining 80% d ata g rowth Dimension Clustering reduction Many echniques for plrocessing b ig d ata e xist, w hich b ring A mtachine lcomputer earning ibrary scikit learn, i mplements Machine are u nstructured Massive application make uds warfs wonder where learning vision domains greater c omplexity f or i dentifying w orkloads Data M ining so many algorithms, which is still much less than the data to start or how to achieve a wide range of coverage total number of algorithms Natural language processing Signal processi ng Regression BPOE 2016 Association rule mining 16

Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE 2016 17

Algorithms Chosen to Investigate Computer Vision Database Software Deep Learning MPEG- 2, Scale- invariant feature transform, Image segmentation, Ray Tracing Needleman- Wunsch, Smith- Waterman, BLAST CNN, DBN Data Mining & Machine Learning C4.5/CART/ID3, Logistic regression, SVM, KNN, HMM, Maximum- entropy markov model, Conditional random field, PageRank, HITS, Aporiori, FP- growth, Principal component analysis, Back Propagation, Adaboost, MCMC, Connected component, Random forest Natural Language Processing Latent semantic indexing, plsi, Latent dirichlet allocation, Index, Porter Stemming, Sphinx speech recognition Recommendation Demographic/Content based recommendation, Collaborative Filtering BPOE 2016 21

Frequently- appearing Operations Similarity Analysis Sampling Statistic Operation KNN, K- means, Recommendation, MCMC, LDA, Feature matching, Random Image sampling, segmentation Downsampling Probability calculation LSI, plsi, Latent dirichletallocation MPEG- 2 Association Set Operations Rules Linear Algebra Mining Sort Graph Operation Neural Jaccard Network similarity Transform Similarity Analysis Operation Locality sensitive hashing FFT, Convolution Back computation, propagation, DCT, CNN, DBN, Neural Speech recognition, network Multimedia Partial Representation sort, quick sort, Top BFS, DFS, Decision Encrpytion, Matrix/Vector k sort index, Calculation tree, Connected Fingerprint, SimHash, Set K- means, Decision tree union Component Sphinx speech recognition, MPEG- 2, SIFT, Image segmentation, Ray tracing Logic Operation MinHash Database Aporiori FP- Growth Set difference SVM, HMM, MEMM, CRF, PageRank, HITS, Logistic regression BPOE 2016 22

Dwarfs in Big Data Offline Analytics Linear Algebra Sampling Transform operation Graph operation Logic operation Set operation Statistic operation Sort BPOE 2016 23

Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Ranking big data systems BPOE 2016 24

BigData100 Ranking n An open source project for benchmarking and ranking big data systems n Using benchmarks from BigDataBench n Big data systems including Hadoop, Spark, Flink and many SQL- interface systems n http://www.bafst.com/items/top100/index.html BPOE 2016 25

Why Do Big Data Systems Ranking n First a quick glance of current big data ecosystem BPOE 2016 26

Big Data Systems Programming Interface SQL-like DataFlow APIs R Language Computation Engines DB layer Graph Layer ML Layer DataFlow Engines Resource Management Graph MPP Streaming Data Storage Column-oriented Format Distributed File Systems Key-Value Storage NOSQL Relational DB Data Sources Structured Data Unstructured Data Semi-structured Data BPOE 2016 27

Computation Engine Programming Model Intermediate Data Management Network Transfer Local Execution Task Scheduling Fault Tolerance Parallel Computation BPOE 2016 28

The Dataflow Model n A computation job is represented as a directed acyclic graph (DAG) consisting of data sources, sinks and operators n First used in the Microsoft Dryad system n Google MapReduce can be seen as a special case of DAG with only two types of operators BPOE 2016 29

DataFlow MapReduce Dryad Input Input1 Input2 Map Sort Operator1 Channel1 Operator2 Channel2 Intermediate Data Transformation Spill Operator3 Channel3 Merge Operator4 Channel4 Reduce Operator5 Output Output User function Spark Input1 Input2 Cached RDD1 RDD6 Transformation1 RDD2 Cached RDD Transformation2 RDD3 Transformation3 RDD4 Transformation4 RDD5 Action Flink Output Input1 Input2 PACT1 Channel1 PACT2 Channel2 PACT3 Channel3 PACT4 Channel4 PACT5 BPOE 2016Output 30

Other Specific Models Impala Pregel Malewicz G, et al. Pregel: a system for large- scale graph processing[c] Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146. Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open- Source SQL Engine for Hadoop[C]//CIDR. 2015. BPOE 2016 31

Why Do Big Data Systems Ranking? n Understanding the performance difference between general- purpose and specific systems n performance comparison n Figuring out how the design of framework impact the performance of specific types of workloads n behavior analysis BPOE 2016 32

Benchmark n Current Benchmarks n Micro workloads: WordCount, Grep n Iteration workloads: PageRank, Kmeans n Interactive Queries Select, Aggregation, Join n Others Naïve Bayes BPOE 2016 33

Benchmark Behaviors: Spark Case n Spark n Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage Can be cached for efficient reuse HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...)) BPOE 2016 34

Benchmark Behaviors n Wordcount, Grep, NaiveBayes, Select n IO- intensive n Computation can be easy paralleled n No or little data transfer DataSet1 Map DataSet2 BPOE 2016 35

Benchmark Behaviors n PageRank, Kmeans n Iterative Computation n Data sets join and aggregation in each iteration n A lot of shuffle operations DataSet1 Iteration 1 Iteration 2 Iteration 3 GroupBy DataSet2 GroupBy DataSet2 GroupBy DataSet2 Join Join Join Map Map Map BPOE 2016 36

Benchmark Behaviors n Database queries n One DAG job that consists of many different transformations Including complex aggregation and join algorithms DataSet1 Map DataSet2 Map Join Partial Aggregation Full Aggregation DataSet3 BPOE 2016 37

Workload Behaviors on CPU and Memory (JVM) n Simple Batch Workloads n Wordcount, Grep, NaiveBayes, Select n Iterative Workloads n PageRank, Kmeans n Interactive Queries n Aggregation, Join BPOE 2016 38

CPU - Simple Batch Workloads WordCount NaiveBayes Grep BPOE 2016 39

CPU - Iterative Workloads PageRank KMeans BPOE 2016 40

CPU - Query Processing Aggregation Join BPOE 2016 41

Memory - JVM Memory Model http://www.journaldev.com/2856/java- jvm- memory- model- and- garbage- collection- monitoring- tuning BPOE 2016 42

JVM Memory Usage in Spark n EC: Eden Capacity n EU: Eden Usage n OC: Old Memory Capacity n OU: Old Memory Usage BPOE 2016 43

Simple Batch Workloads WordCount NaiveBayes Grep Select BPOE 2016 44

Iterative Workloads PageRank KMeans BPOE 2016 45

Query Processing Aggregation Join BPOE 2016 46

Revisit the dataflow Iteration 1 Iteration 2 Iteration 3 GroupBy DataSet2 GroupBy DataSet2 GroupBy DataSet2 Join Join Join Map Map Map DataSet1 Map DataSet2 Map Join Partial Aggregation Full Aggregation DataSet3 BPOE 2016 47

Some Primary Results of n Cluster Configuration BigData100 Computation Nodes CPU per node Memory per node Disk per node Network 16 nodes 2 x Intel Xeon E5645, 12 cores 32 GB 1TB x 2 SATA disks Gigabit Ethernet BPOE 2016 48

System Selected n Simple Batch Workloads and Iterative Workloads n Hadoop 2.7.1, Spark 1.5.1, Flink 0.9.1 n Interactive Query Processing n Spark 1.5.1, Hive 1.2.1 (on Hadoop, on Tez and on Spark), Impala 1.4.0 BPOE 2016 49

Size of data sets n WordCount, Grep n 1.600,000,000 English words from Wikipedia n About 300GB n Naïve Bayes n 5,700,000 reviews from Amazon Movie Reviews n 300GB BPOE 2016 50

Size of data sets n PageRank n Google Web Graph, 16 million vertexes, 99 million edges (unstructured graph) n KMeans n Facebook Social Network, 460 million vectors n Queries subset of TPC- DS n item: 4 GB n customer: 13 GB n order: 100 GB BPOE 2016 51

Results Simple Batch Workloads WordCount Grep 1200 1105 600 548 RunTime (s) 1000 800 600 400 680 345 Run Time (s) 500 400 300 200 410 330 200 100 0 Hadoop Spark Flink 0 Hadoop Spark Flink BPOE 2016 52

Results 600 500 490 Naive Bayes 563 Run Time (s) 400 300 351 200 100 0 Hadoop Spark Flink BPOE 2016 53

Observation n Spark has the best performance on that kind of workloads n Good data locality n One task for each partition n One reason for the bad performance in Flink n High level data locality Considering the whole file BPOE 2016 54

Results Iterative Workloads PageRank KMeans 9000 8000 7753 8000 7000 6722 Run Time (s) 7000 6000 5000 4000 3000 2000 1000 780 598 469 Run Time(s) 6000 5000 4000 3000 2000 1000 672 805 0 Hadoop Spark Flink Flink- delta 0 Hadoop Spark Flink BPOE 2016 55

Observation n Spark, Flink can be much faster than Hadoop n Advantage of the DAG model n The delta iteration from Flink can be more efficient than the bulk model used in Spark BPOE 2016 56

Results Queries 70 60 Select 58 56 65 Run Time(s) 50 40 30 20 10 38 9 0 SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE 2016 57

Results - Queries Aggregation 600 500 490 Run Time(s) 400 300 200 155 174 248 123 100 0 SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE 2016 58

Results Queries Join Run Time(s) 900 800 700 600 500 400 300 200 100 0 791 593 342 318 214 SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE 2016 59

Observations n Impala has the best performance on select and aggregation n Specific data structure designed for database query n Good disk performance due to I/O buffer managemen and short- circuit local reads n Impala gets bad performance for join operation in this situation n Big pressure on a small part of nodes BPOE 2016 60

Thanks BPOE 2016 61