Modern Processors using BigDataBench

Understanding Big Data Workloads on Modern Processors using BigDataBench Jianfeng Zhan http://prof.ict.ac.cn/bigdatabench Professor, ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences HPBDC 2015 Ohio, USA INSTITUT TE OF COMPUTING TECHNOLOGY

Outline BigDataBench Overview Workload characterization Multi-tenancy version Processors evaluation

What is BigDataBench? An open source big data benchmarking project http://prof.ict.ac.cn/bigdatabench Search Google using BigDataBench

BigDataBench Detail Methodology Five application domains Propose benchmark specifications for each domain Implementation 14 Real world data sets & 3 kinds of big data generators 33 Big data workloads with diverse implementation Specificpurpose Version BigDataBench subset version

Nucleot tides (billio on) Five Application Domains DDBJ/EMBL/GenBank / database Growth Taking up 80% of Nucleotides Entries internet services Internet Search Engine ServiceSocial Network 200 Multimedia according to pg page Search Electronic engine, Social Commerce new network, E commerce Media Mdi Streaming 180 new 180 views and daily visitors Others 160 VIDEOS 160 on YouTube 15% hours MUSIC streaming PHOTOS on FLICKR every 5% 140 every eey 140minute on PANDORA 40% every minute minute 120 120 15% 100 100 80 80 60 25% data growth 60 VIDEO 40 feeds from 40 minutes Bioinformatics VOICE calls on are surveillance 20 cameras 20 Skype every minute IMAGES, VIDEOS, doc 0 Top 20 websites uments, 0 http://www.oldcolony.us/wp content/uploads/2014/11/whatisbigdata DKB v2.pdf http://www.alexa.com/topsites/global;0 p// / p /g http://www.ddbj.nig.ac.jp/breakdown_stats/dbgrowth e.html#dbgrowth graph Entrie es (million)

Benchmark specification Guidelines for BigDataBench implementation Data model workloads Describe data model Model typical application scenarios Extract important workloads

BigDataBench Details Methodology Five application domains Benchmark specification for each domain Implementation 14 Real world data sets & 3 kinds of big data generators 33 Big data workloads with diverse implementation Specificpurpose Version BigDataBench subset version

BigDataBench Summary BDGS(Big Data Generator Suite) for scalable data Wikipedia Entries Amazon Movie Reviews Google Web Graph Facebook Social NetworkE commerce Transaction ImageNet English broadcasting audio ProfSearch Resumes DVD Input Streams Image scene SoGou Data Genome sequence data Assembly of the human genome MNIST 14 Real world Data Sets Search Engine Multimedia Social E-commerce Network Bioinformatics 33 Workloads NoSql Impala Shark Hadoop RDMA MPI DataMPI Software Stacks

Big Data Generator Tool 3 kinds of big data generators Preserving original characteristics of real data Text/Graph/Table generator

BigDataBench Details Methodology Five application domains Benchmark specification for each domain Implementation 14 Real world data sets & 3 kinds of big data generators 33 Big data workloads with diverse implementations Specificpurpose Version BigDataBench subset version

BigDataBench Subset Motivation Expensive to run all the benchmarks for system and architecture researches multiplied by different implementations BigDataBench 3.0 provides about 77 workloads Eliminate the correlation data Identify workload characteristics from a specific perspective Dimension reduction (PCA) Clustering (K Means) Subset

Why BigDataBench? Specifi Application Workload Work Scalable data Multiple Multite Subs Simulat cation domains Types loads sets s (from real impleme pe e nancy ets es or data) ntations version BigDataBench Y Five Four [1] 33 8 Y Y Y Y BigBench Y One Three 10 3 N N N N Cloud Suite N N/A Two 8 3 N N N Y HiBench N N/A Two 10 3 N N N N CALDA Y N/A One 5 N/A Y N N N YCSB Y N/A One 6 N/A Y N N N LinkBench Y N/A One 10 N/A Y N N N AMP Benchmarks Y N/A One 4 N/A Y N N N [1] The four workloads types include Offline Analytics, Cloud OLTP, Interactive Analytics and Online Service

BigDataBench Users http://prof.ict.ac.cn/bigdatabench/users/ fi t /Bi t B h/ / Industry users Accenture, BROADCOM, SAMSUMG, Huawei, IBM China s first industry standard big data benchmark suite http://prof.ict.ac.cn/bigdatabench/industry standardp//p / / benchmarks/ About 20 academia groups published pp papers using BigDataBench

BigDataBench Publications BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA 2014). Characterizing data analysis workloads in data centers. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award) BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for AdvancedApplications Applications (DASFAA 2014) BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014) Identifying i Dwarfs Workloads in Big Data Analytics arxivpreprint i arxiv:1505.06872 BigDataBench MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads arxiv preprint arxiv:1504.02205

System Behaviors Diversified system level behaviors: Pe ercentage Weighte ed I/O time ratio 100% 80% 60% 40% 20% 0% 100 10 1 0.1 0.01 CPU utilization I/O wait ratio H-Grep(7) S-Kmeans(1) S-PageRank(1) H-Grep(7) S-K Kmeans(1) S-Pag gerank(1) H-WordCount(1) H-Wor dcount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H-Read(10) H-Difference(9) I-S SelectQuery(9) H-Bayes(1) M-Bayes M-Kmeans M-P PageRank H- -Read(10) H-Diff ference(9) I-Select tquery(9) S-WordCount(8) S-Wor dcount(8) S-Project(4) S-OrderBy(3) S-Grep(1) M-Grep H-TPC-DS- S- -Project(4) S-O OrderBy(3) S-Grep(1) M-Grep H -TPC-DS- I-OrderBy(7) I-O OrderBy(7) S-TPC-DS- S -TPC-DS- S-TPC-DS- S -TPC-DS- S-Sort(1) S-Sort(1) M-WordCount M-W WordCount M-Sort M-Sort AVG_S_BigData AVG_S S_BigData The Average Weighted disk I/O time ratio

System Behaviors Diversified system level behaviors: High CPU utilization & less I/O time Pe ercentage 100% 80% 60% 40% 20% 0% 100 H-Grep(7) S-K Kmeans(1) S-Pag gerank(1) H-Wor dcount(1) H-Bayes(1) M-Bayes M-Kmeans M-P PageRank H- -Read(10) H-Diff ference(9) I-Select tquery(9) CPU utilization I/O wait ratio S-Wor dcount(8) S- -Project(4) S-O OrderBy(3) S-Grep(1) M-Grep H -TPC-DS- I-O OrderBy(7) -TPC-DS- -TPC-DS- S-Sort(1) S S M-W WordCount M-Sort AVG_S S_BigData tio ed I/O time rat Weight 10 1 0.1 0.01 H-Grep(7) S-Kmeans(1) S-PageRank(1) H-WordCount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H-Read(10) H-Difference(9) I-S SelectQuery(9) S-WordCount(8) S-Project(4) S-OrderBy(3) S-Grep(1) M-Grep H-TPC-DS- I-OrderBy(7) The Average Weighted disk I/O time ratio S-TPC-DS- S-TPC-DS- S-Sort(1) M-WordCount M-Sort AVG_S_BigData

System Behaviors Diversified system level behaviors: High CPU utilization & less I/O time Low CPU utilization relatively and lots of I/O time Pe ercentage Weight ted I/O time ra atio 100% 80% 60% 40% 20% 0% 100 10 1 0.1 0.01 CPU utilization I/O wait ratio H-Grep(7) S-Kmeans(1) S-PageRank(1) H-Grep(7) S-K Kmeans(1) S-Pag gerank(1) H-WordCount(1) H-Wor dcount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H-Read(10) H-Difference(9) I-S SelectQuery(9) H-Bayes(1) M-Bayes M-Kmeans M-P PageRank H- -Read(10) H-Diff ference(9) I-Select tquery(9) S-WordCount(8) S-Wor dcount(8) S-Project(4) S-OrderBy(3) S-Grep(1) M-Grep H-TPC-DS- S- -Project(4) S-O OrderBy(3) S-Grep(1) M-Grep H -TPC-DS- I-OrderBy(7) I-O OrderBy(7) S-TPC-DS- S -TPC-DS- S-TPC-DS- S -TPC-DS- S-Sort(1) S-Sort(1) M-WordCount M-W WordCount M-Sort M-Sort AVG_S_BigData AVG_S S_BigData The Average Weighted disk I/O time ratio

System Behaviors Diversified system level behaviors: High CPU utilization & less I/O time Relatively low CPU utilization & lots of I/O time Mdi Medium CPU utilization & I/O time Pe ercentage ratio Weig ghted I/O time 100% 80% 60% 40% 20% 0% 100 10 1 0.1 0.01 H-Grep(7) S-K Kmeans(1) (7) (1) H-Grep S-Kmeans S-Pag gerank(1) S-PageRank k(1) H-Wor dcount(1) H-WordCount t(1) H-Bayes(1) M-Bayes H-Bayes (1) M-Bay yes M-Kmeans M-Kmea ans M-P PageRank M-PageRa ank H- -Read(10) H-Read(1 10) H-Diff ference(9) H-Difference (9) I-Select tquery(9) I-SelectQuery( (9) CPU utilization I/O wait ratio S-Wor dcount(8) S-WordCount t(8) S- -Project(4) S-Project t(4) S-O OrderBy(3) S-Grep(1) M-Grep S-OrderBy y(3) S-Grep (1) M-Gr rep H -TPC-DS- H-TPC-D DS- I-O OrderBy(7) I-OrderBy y(7) -TPC-DS- -TPC-DS- S-Sort(1) S S S-TPC-D DS- S-TPC-D DS- S-Sort t(1) M-W WordCount M-Sort M-WordCou unt M-S Sort AVG_S S_BigData AVG_S_BigDa ata The average Weighted disk I/O time ratios

Workloads Classification From perspective of system behaviors: System behaviors vary across different workloads Workloads are divided into 3 categories: Type Workloads CPU Intensive H Grep, S Kmeans, S PageRank, H WordCount, H Bayes, M Bayes, M Kmeans and M PageRank I/O Intensive H Read, H Difference, I SelectQuery, S WordCount, S Project, S OrderBy, M Grep and S Grep Hybrid H TPC DS query3, I OrderBy, S TPC DS query10, S TPC DS query8, S Sort, M WordCount and M Sort

Off Chip Bandwidth Most of CPU intensive workloads have higher off chip bandwidth (3GB/s), Maximum is 6.2GB/s; Other workloads have lower off chip bandwidth (0.6GB/s). MPI based workloads need low memory bandwidth.

IPC of BigDataBench vs. other benchmarks 2 1.5 IPC 1 0.5 0 H-Gr rep(7) S-Kmea ans(1) S-PageRa ank(1) H-WordCou unt(1) H-Bay yes(1) M-B Bayes M-Km means Rank d(10) ce(9) ry(9) M-Page H-Rea H-Differen I-SelectQue S-WordCou unt(8) S-Proje ect(4) S-Order By(3) S-Gr rep(1) M- -Grep H- -TPC-DS-quer ry3(9) I-Order By(7) S-TPC C-DS- S- -TPC-DS-quer ry8(1) S-So ort(1) M-WordC Count M-Sort AVG_S_Big gdata TP PC-C AVG_Cloud Suite Avg_H HPCC Avg_PAR RSEC AVG_SP PECfp AVG_SPE ECint The average IPC of the big data workloads is larger than that of CloudSuite, SPECFP and SPECINT, similar with PARSEC and slightly lower than HPCC The avrerage IPC of BigDataBench is 1.3 times of that of CloudSuite Some workloads have high IPC (M_Kmeans, S TPC DS Query8)

Instructions Mix of BigDataBench vs. other benchmarks Big data workloads are data movement dominated dcomputing with ihmore branch operations 92% percentage in terms of instruction mix (Load + Store + Branch + data movements of INT)

Pipeline Stalls The service workloads have more RAT (Register Allocation Table) stalls The data analysis workloads have more RS (Reservation Station) and ROB (ReOrder Buffer) full stalls Notable front end stalls (i.e., instruction fetch stall)! Data analysis Service

Cache Behaviors of BigDataBench CPU intensive I/O intensive hybrid L1I MPKI Larger than traditional benchmarks, but lower than that of CloudSuite (12 vs. 31) Different among big data workloads CPU intensive(8), I/O intensive(22), and hybrid workloads(9) One order of magnitude differences among diverse implementations M_WordCount is 2, while H_WordCount is 17

Cache Behaviors L2 Cache: The IO intensive workloads undergo more L2 MPKI L3 Cache: The average L3 MPKI of the big data workloads is lower than all of the other workloads The underlying software stacks impact data locality MPI workloads have better data locality and less cache misses

TLB Behaviors CPU intensive I/O intensive hybrid 10 DTLB MPKI ITLB MPKI misses (MPKI) TLB 1 0.1 0.01 H-Grep(7) S-Km means(1) ITLB S-Page erank(1) H-Word dcount(1) Bayes(1) H- M-Bayes -Kmeans M agerank M-P H-R Read(10) H-Diffe erence(9) I-SelectQ Query(9) IO intensive workloads undergo more ITLB MPKI. DTLB CPU intensive workloads have more DTLB MPKI. S-Word dcount(8) S-P Project(4) rderby(3) S-Or S-Grep(1) M-Grep H-TPC-DS-q query3(9) rderby(7) I-Or S-TPC-DS-qu uery10(4) S-TPC-DS-q query8(1) S-Sort(1) M-Wo ordcount M-Sort AVG_S BigData TPC-C oudsuite AVG_Cl Av vg_hpcc Avg_P PARSEC AVG SPECfp AVG SPECint

Our observations from BigDataBench Unique characteristics Data movement dominated computing with more branch operations 92% percentage in terms of instruction mix Notable pipeline frontend stalls Different bh behaviors among Big Data workloads Disparity of IPCs and memory access behaviors CloudSuite is a subclass of Big Data Software stacks impacts The L1I cache miss rates have one order of magnitude differences among diverse implementations with different software stacks.

Correlation Analysis Compute the correlation coefficients of CPI with other micro architecture level metrics. Pearson s correlation coefficient: Absolute value (from 0 to 1) shows the dependency: The bigger the absolute value, the stronger the correlation.

Top five coefficients 1 5 Naive Bayes Grep WordCount Kmeans FKmeans PageRank Sort Hive IBCF HMM SVM

Insights Frontend stall does not havehighhigh correlation coefficient value for most of big data analytics workloads Frontend stall is not the factor that affects the CPI performance most. L2 cache misses and TLB misses have high correlation coefficient values. The long latency memory accesses (access L3 cache or memory) affect the CPIperformance most and should be the optimization point with highest priority.

Cloud Data Centers Two class of popular workloads Long running g services Search engines, E commerce sites Short term term data analytic jobs Hadoop MapReduce, Spark jobs

Problem Existing benchmarks focus on specific types of workload Scenarios are not realistic Does not match ththe typical ldt data center scenario that mixes different percentages of tenants and workloads shareing the same computing infrastructure

Purpose of BigDataBench MT Developing realistic benchmarks to reflect such practical scenarios of mixed workloads. Both service and data analytic workloads Dynamic scaling up and down The tool is publicly available from http://prof.ict.ac.cn/bigdatabench/multip//p / / tenancyversion

What can you do with it? We consider two dimensions of the benchmarking scenarios From tenants perspectives From workloads perspectives

You can specify the tenants The number of tenantst Scalability Benchmark: How many tenants are able run in parallel? The priorities of tenants Fi Fairness Benchmark: How fair fiis the system, i.e., are the available resources equally available to all tenants? If tenants have different priorities? Time line How the number and priorities of tenants change over time?

You can specify the workloads Dt Data characteristics ti Data type, source Input/output data volumes, distributions Computation semantics Source code Big data softwarestacksstacks Job arrival patterns Arrival rate Arrival sequence

Two major challenges Ht Heterogeneity of real workloads Different workload types e.g. CPU or I/O intensive workloads Different software stacks e.g. Hadoop, Spark, MPI Workload dynamicity hidden in real world traces Arrival patterns Request/Job submitting time and sequences Job input sizes e.g. ranging from KB to ZB

Existing big data benchmarks Benchmarks Actual workloads Real workload traces Mixed workloads AMPLab benchmark, Linkbench, Bigbench, YCSB, CloudSuite Yes No No GridMix, SWIM No Yes No How to generate real workloads on the basis of real workload traces, is still an open question.

System Overview Three modules Benchmark User Portal A visual interface Combiner of Workloads and Traces A matcher of real workloads and traces Multi tenant Workload Generator A multi tenant workload generator

Key technique: Combination of real and synthetic data analytic jobs Goal: Combining the arrival patterns extracted t dfrom real traces with real workloads. Problem: Workload traces only contain anonymous jobswhose workload types and/or input data are unknown.

Solution: the first step Deriving the workload characteristics of both real and anonymous jobs TABLE. Metrics to represent workload characteristics Metric Execution time CPU usage Memory usage CPI MAI Description Measured in seconds Total CPU time per second Measured in GB Cycles per instruction The number of memory accesses per instruction

Solution: the second step Matching both types of jobs whose workload characteristics are sufficiently similar

An Example An example of matching Hadoop workloads Mining Facebook/Google workload trace (Exact workload characteristics information Profiling Hadoop workloads from BigDataBench (Collect workload characteristics information) Workload matching using k means clustering Matching result: replaying basis Job type Input Starting size (GB) Time (minutes) Bayes 2 10 Sort 1 20 K means 0.5 25 Bayes 5 30 Sort 1 40

System demostration Three steps to generate a mix of search service and Hadoop MapReduce jobs Traces : 24 hour Sogou user query logs and Google cluster trace. Step 1 Specification of tested machines and workloads Step2 2 Selection ofbenchmarking period and scale Step 3 Generation of mixed workloads

Workloads and traces in BigDataBench MT Multi tenancy t V1.0 releases: Workloads Software stack Workload trace Nutch Web Apache Tomcat Sogou Search 6.0.26, Search (http://www.sogou.com/labs/dl Server(Nutch) /q e.html) Hadoop Hadoop 1.0.2 Facebook (https://github.com/swimproje ctucb/swim/wiki) Shark Shark 0.8.0 0 Google data center (https://code.google.com/p/goo gleclusterdata/)

Core Architecture Multi brawny core (Xeon E5645, 2.4 GHz) 6 Out of Order cores Dynamic Multiple Issue (supper scalar) Dynamic Overclocking Simultaneous multithreading Many wimpy core architecture (Tile Gx36, 1.2 GHz): 36 In Order cores Static Multiple Issue (VLIW)

Experiment methodology User real hardware instead of simulation Real power consumption measurement instead of modeling Saturate CPU performance by: Isolate the processor behavior Over provisions the disk I/O subsystem by using RAM disk Optimize benchmarks Tune the software stack parameters JVM flags to performance

Execution time For Hadoop based sort, the performance gap is about 1.08. For the other workloads, more than 2 gaps exist between Xeon and Tilera. Normalized Tim me 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Xeon Tilera From the perspective of execution time, the Xeon processor is better than Tilera processor all the time.

Cycle Counts There are huge cycle count gaps between Xeon and Tilera ranging from 5.3 to 14. Tilera need more cycles to complete lt the same amount of work. 16 Xeon Tilera 14 Normaliz zed Cycles 12 10 8 6 4 2 0

Pipeline Efficiency The theoretical IPC: Xeon: 4 instructions per cycle Pipeline efficiency: Tilera: 1 instruction bundle per cycle ine Efficiency Pipeli 0.45 0.4 0.35 0.3 0.25 0.2 015 0.15 0.1 0.05 0 Tilera Xeon OoO pipelines are more efficient than in order ones

Power Consumption Tilera is power optimized. Xeon consumes morepower. alized Power Norm 1.8 Tilera Xeon 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

Energy Consumption Hadoop based sort consumes less energy on Tilera than on Xeon Hadoop sort is an extremely I/O intensive workloads. Tilera consumes moreenergythanxeontocompletethesameamountof energy to the amount of work for most big data workloads The longer execution time offsets the lower power design 35 3.5 Xeon Tilera zed Energy Normali 3 2.5 2 1.5 1 0.5 0

Total Cost of Ownership (TCO) Model [*] [] Three year depreciation cycle Hardware costs associated with individual components CPU Memory Disk Board Power Cooling [*] K. Lim et al. Understanding and designing new server architectures for emerging warehouse computing environments. ISCA 2008

Cost model The cost data originate from diverse sources: Different vendors Corresponding official websites Power and cooling: An activity factor of 0.75

Performance per TCO Haoop based Sort has higher performance per TCO on the Tilera. For other workloads, Xeon outperforms Tilera. ance per TCO Norm malized Perform 3.5 3 2.5 2 1.5 1 0.5 0 Tilera Xeon Turbo & HT enabled

Key Takeaways Try using an open source big data benchmark suite from http://prof.ict.ac.cn/bigdatabench Big Data: data movement dominated computing with more branch operations 92% percentage in terms of instruction mix Multi tenancy tenancy version: replaying mixed workloads according to publicly available workloads traces. Wimpy core processors only suit a part of big data workloads.