Benchmarking and Ranking Big Data Systems

Transcription

1 Benchmarking and Ranking Big Data Systems Xinhui Tian ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences INSTITUTE OF COMPUTING TECHNOLOGY

2 Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Ranking big data systems BPOE

3 Why Big Data Benchmarking? Measuring big data systems and architectures quantitatively BPOE

4 What is BigDataBench? n An open source big data benchmarking project Search Google using BigDataBench BPOE

5 BigDataBench Users n n Industry users n Accenture, BROADCOM, SAMSUMG, Huawei, IBM n About 20 academia groups published papers using BigDataBench BPOE

6 BigDataBench 3.2 Overview BDGS(Big Data Generator Suite) for scalable data Wikipedia Entries Amazon Movie Reviews Google Web Graph Facebook Social NetworkE- commerce Transaction ImageNet English broadcasting audio ProfSearch Resumes DVD Input Streams Image scene Genome sequence data Assembly of the human genome SoGou Data MNIST MovieLens Dataset 15 Real- world Data Sets Impala NoSql Search Engine Multimedia Social E-commerce Network Bioinformatics 37 Workloads Shark Hadoop RDMA MPI DataMPI Software Stacks BPOE

7 What s New in BigDataBench 3.2 BigDataBench support for Flink WordCount, Grep, Naïve Bayes, PageRank, K- means Streaming JStorm, Spark Streaming Graph processing GraphX, GraphLab, Flink Gelly BPOE

8 BigDataBench evolution New software stack: Flink, JStorm, GraphX, GraphLab New workload type: Streaming, Graph processing New dataset and workloads BigDataBench application domains: 14 data sets and 33 workloads Same specifications: diverse implementations Multi- tenancy version BigDataBench subset and simulator version BigDataBench Multidisciplinary effort 32 workloads: diverse implementations BigDataBench Typical Internet service domains An architectural perspective 19 workloads & data generation tools BigDataBench Search engine 6 workloads BigDataBench data analytics workloads DCBench 1.0 Mixed data analytics workloads CloudRank 1.0 BPOE

9 BigDataBench Publications n n n n n n BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA- 2014). Characterizing data analysis workloads in data centers IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award) BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014) BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014) Identifying Dwarfs Workloads in Big Data Analytics arxiv preprint arxiv: BigDataBench- MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads arxiv preprint arxiv: BPOE

10 Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Remarking big data systems BPOE

11 BigDataBench Methodology Application Domain 1 Benchmark specification 1 Real- world data sets Multi- tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset BPOE

12 BigDataBench Methodology with Dwarfs Application Domain 1 Benchmark specification 1 Real- world data sets Multi- tenancy version Application Domain Data models of different types & semantics Dwarfs Workloads Identification Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset BPOE

13 Dwarfs Workloads n Finding the common atomic workloads in big data analytics workloads n Representing maximum patterns of big data analytics using a minimum workload set BPOE

14 Inspiration Successful Compute Abstractions Relational algebra 5 primitive operations Select, Project, Product, Union, Difference Successful Benchmarks TPC- C OLTP domain Functions of abstraction Parallel computing Computational & communication patterns 13 dwarfs HPCC High performance computing Seven basically tests BPOE

15 Fundamental Issues What are the challenges for big data dwarfs? How to find the dwarfs workloads in big data analytics? BPOE

16 Challenges Deep learning Classification Data mining 80% d ata g rowth Dimension Clustering reduction Many echniques for plrocessing b ig d ata e xist, w hich b ring A mtachine lcomputer earning ibrary scikit learn, i mplements Machine are u nstructured Massive application make uds warfs wonder where learning vision domains greater c omplexity f or i dentifying w orkloads Data M ining so many algorithms, which is still much less than the data to start or how to achieve a wide range of coverage total number of algorithms Natural language processing Signal processi ng Regression BPOE 2016 Association rule mining 16

17 Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE

21 Algorithms Chosen to Investigate Computer Vision Database Software Deep Learning MPEG- 2, Scale- invariant feature transform, Image segmentation, Ray Tracing Needleman- Wunsch, Smith- Waterman, BLAST CNN, DBN Data Mining & Machine Learning C4.5/CART/ID3, Logistic regression, SVM, KNN, HMM, Maximum- entropy markov model, Conditional random field, PageRank, HITS, Aporiori, FP- growth, Principal component analysis, Back Propagation, Adaboost, MCMC, Connected component, Random forest Natural Language Processing Latent semantic indexing, plsi, Latent dirichlet allocation, Index, Porter Stemming, Sphinx speech recognition Recommendation Demographic/Content based recommendation, Collaborative Filtering BPOE

22 Frequently- appearing Operations Similarity Analysis Sampling Statistic Operation KNN, K- means, Recommendation, MCMC, LDA, Feature matching, Random Image sampling, segmentation Downsampling Probability calculation LSI, plsi, Latent dirichletallocation MPEG- 2 Association Set Operations Rules Linear Algebra Mining Sort Graph Operation Neural Jaccard Network similarity Transform Similarity Analysis Operation Locality sensitive hashing FFT, Convolution Back computation, propagation, DCT, CNN, DBN, Neural Speech recognition, network Multimedia Partial Representation sort, quick sort, Top BFS, DFS, Decision Encrpytion, Matrix/Vector k sort index, Calculation tree, Connected Fingerprint, SimHash, Set K- means, Decision tree union Component Sphinx speech recognition, MPEG- 2, SIFT, Image segmentation, Ray tracing Logic Operation MinHash Database Aporiori FP- Growth Set difference SVM, HMM, MEMM, CRF, PageRank, HITS, Logistic regression BPOE

23 Dwarfs in Big Data Offline Analytics Linear Algebra Sampling Transform operation Graph operation Logic operation Set operation Statistic operation Sort BPOE

24 Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Ranking big data systems BPOE

25 BigData100 Ranking n An open source project for benchmarking and ranking big data systems n Using benchmarks from BigDataBench n Big data systems including Hadoop, Spark, Flink and many SQL- interface systems n BPOE

26 Why Do Big Data Systems Ranking n First a quick glance of current big data ecosystem BPOE

27 Big Data Systems Programming Interface SQL-like DataFlow APIs R Language Computation Engines DB layer Graph Layer ML Layer DataFlow Engines Resource Management Graph MPP Streaming Data Storage Column-oriented Format Distributed File Systems Key-Value Storage NOSQL Relational DB Data Sources Structured Data Unstructured Data Semi-structured Data BPOE

28 Computation Engine Programming Model Intermediate Data Management Network Transfer Local Execution Task Scheduling Fault Tolerance Parallel Computation BPOE

29 The Dataflow Model n A computation job is represented as a directed acyclic graph (DAG) consisting of data sources, sinks and operators n First used in the Microsoft Dryad system n Google MapReduce can be seen as a special case of DAG with only two types of operators BPOE

30 DataFlow MapReduce Dryad Input Input1 Input2 Map Sort Operator1 Channel1 Operator2 Channel2 Intermediate Data Transformation Spill Operator3 Channel3 Merge Operator4 Channel4 Reduce Operator5 Output Output User function Spark Input1 Input2 Cached RDD1 RDD6 Transformation1 RDD2 Cached RDD Transformation2 RDD3 Transformation3 RDD4 Transformation4 RDD5 Action Flink Output Input1 Input2 PACT1 Channel1 PACT2 Channel2 PACT3 Channel3 PACT4 Channel4 PACT5 BPOE 2016Output 30

31 Other Specific Models Impala Pregel Malewicz G, et al. Pregel: a system for large- scale graph processing[c] Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open- Source SQL Engine for Hadoop[C]//CIDR BPOE

32 Why Do Big Data Systems Ranking? n Understanding the performance difference between general- purpose and specific systems n performance comparison n Figuring out how the design of framework impact the performance of specific types of workloads n behavior analysis BPOE

33 Benchmark n Current Benchmarks n Micro workloads: WordCount, Grep n Iteration workloads: PageRank, Kmeans n Interactive Queries Select, Aggregation, Join n Others Naïve Bayes BPOE

34 Benchmark Behaviors: Spark Case n Spark n Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage Can be cached for efficient reuse HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...)) BPOE

35 Benchmark Behaviors n Wordcount, Grep, NaiveBayes, Select n IO- intensive n Computation can be easy paralleled n No or little data transfer DataSet1 Map DataSet2 BPOE

36 Benchmark Behaviors n PageRank, Kmeans n Iterative Computation n Data sets join and aggregation in each iteration n A lot of shuffle operations DataSet1 Iteration 1 Iteration 2 Iteration 3 GroupBy DataSet2 GroupBy DataSet2 GroupBy DataSet2 Join Join Join Map Map Map BPOE

37 Benchmark Behaviors n Database queries n One DAG job that consists of many different transformations Including complex aggregation and join algorithms DataSet1 Map DataSet2 Map Join Partial Aggregation Full Aggregation DataSet3 BPOE

38 Workload Behaviors on CPU and Memory (JVM) n Simple Batch Workloads n Wordcount, Grep, NaiveBayes, Select n Iterative Workloads n PageRank, Kmeans n Interactive Queries n Aggregation, Join BPOE

39 CPU - Simple Batch Workloads WordCount NaiveBayes Grep BPOE

40 CPU - Iterative Workloads PageRank KMeans BPOE

41 CPU - Query Processing Aggregation Join BPOE

42 Memory - JVM Memory Model jvm- memory- model- and- garbage- collection- monitoring- tuning BPOE

43 JVM Memory Usage in Spark n EC: Eden Capacity n EU: Eden Usage n OC: Old Memory Capacity n OU: Old Memory Usage BPOE

44 Simple Batch Workloads WordCount NaiveBayes Grep Select BPOE

45 Iterative Workloads PageRank KMeans BPOE

46 Query Processing Aggregation Join BPOE

47 Revisit the dataflow Iteration 1 Iteration 2 Iteration 3 GroupBy DataSet2 GroupBy DataSet2 GroupBy DataSet2 Join Join Join Map Map Map DataSet1 Map DataSet2 Map Join Partial Aggregation Full Aggregation DataSet3 BPOE

48 Some Primary Results of n Cluster Configuration BigData100 Computation Nodes CPU per node Memory per node Disk per node Network 16 nodes 2 x Intel Xeon E5645, 12 cores 32 GB 1TB x 2 SATA disks Gigabit Ethernet BPOE

49 System Selected n Simple Batch Workloads and Iterative Workloads n Hadoop 2.7.1, Spark 1.5.1, Flink n Interactive Query Processing n Spark 1.5.1, Hive (on Hadoop, on Tez and on Spark), Impala BPOE

50 Size of data sets n WordCount, Grep n 1.600,000,000 English words from Wikipedia n About 300GB n Naïve Bayes n 5,700,000 reviews from Amazon Movie Reviews n 300GB BPOE

51 Size of data sets n PageRank n Google Web Graph, 16 million vertexes, 99 million edges (unstructured graph) n KMeans n Facebook Social Network, 460 million vectors n Queries subset of TPC- DS n item: 4 GB n customer: 13 GB n order: 100 GB BPOE

52 Results Simple Batch Workloads WordCount Grep RunTime (s) Run Time (s) Hadoop Spark Flink 0 Hadoop Spark Flink BPOE

53 Results Naive Bayes 563 Run Time (s) Hadoop Spark Flink BPOE

54 Observation n Spark has the best performance on that kind of workloads n Good data locality n One task for each partition n One reason for the bad performance in Flink n High level data locality Considering the whole file BPOE

55 Results Iterative Workloads PageRank KMeans Run Time (s) Run Time(s) Hadoop Spark Flink Flink- delta 0 Hadoop Spark Flink BPOE

56 Observation n Spark, Flink can be much faster than Hadoop n Advantage of the DAG model n The delta iteration from Flink can be more efficient than the bulk model used in Spark BPOE

57 Results Queries Select Run Time(s) SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE

58 Results - Queries Aggregation Run Time(s) SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE

59 Results Queries Join Run Time(s) SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE

60 Observations n Impala has the best performance on select and aggregation n Specific data structure designed for database query n Good disk performance due to I/O buffer managemen and short- circuit local reads n Impala gets bad performance for join operation in this situation n Big pressure on a small part of nodes BPOE

61 Thanks BPOE