Big Data Benchmark Suite

Similar documents

Benchmarking and Ranking Big Data Systems

Modern Processors using BigDataBench

How To Use A Webmail On A Pc Or Macodeo.Com

BPOE Research Highlights

BigDataBench: a Big Data Benchmark Suite from Internet Services

arxiv: v1 [cs.db] 26 May 2015

Introducing EEMBC Cloud and Big Data Server Benchmarks

On Big Data Benchmarking

BigDataBench. Khushbu Agarwal

On Big Data Benchmarking

LeiWang, Jianfeng Zhan, ZhenJia, RuiHan

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

The Internet of Things and Big Data: Intro

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Big Data: Image & Video Analytics

IT services for analyses of various data samples

Big Data Systems CS 5965/6965 FALL 2015

Doing Multidisciplinary Research in Data Science

Big Data Systems CS 5965/6965 FALL 2014

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Challenges for Data Driven Systems

How To Write A Bigbench Benchmark For A Retailer

Machine Learning over Big Data

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Concept and Project Objectives

Introduction to Data Mining

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

The Scientific Data Mining Process

Search and Information Retrieval

MACHINE LEARNING BASICS WITH R

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

NextGen Infrastructure for Big DATA Analytics.

Graph Database Proof of Concept Report

Data Centric Computing Revisited

Big Data Challenges in Bioinformatics

Information Management course

Big Systems, Big Data

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

The 3 questions to ask yourself about BIG DATA

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Sanjeev Kumar. contribute

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Big Data. Fast Forward. Putting data to productive use

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Advanced In-Database Analytics

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data and Marketing

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Unified Big Data Processing with Apache Spark. Matei

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

The 4 Pillars of Technosoft s Big Data Practice

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. Aayush Agrawal

Large-Scale Data Processing

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Business Challenges and Research Directions of Management Analytics in the Big Data Era

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Topics in basic DBMS course

Steven C.H. Hoi School of Information Systems Singapore Management University

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

NetView 360 Product Description

Big Data. Lyle Ungar, University of Pennsylvania

Data Science, Predictive Analytics & Big Data Analytics Solutions. Service Presentation

Big Data: Rethinking Text Visualization

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Hadoop. Sunday, November 25, 12

Industry Impact of Big Data in the Cloud: An IBM Perspective

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Using In-Memory Computing to Simplify Big Data Analytics

Hadoop-BAM and SeqPig

An Introduction to Data Mining

Spark and the Big Data Library

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

COMP9321 Web Application Engineering

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Machine Learning Introduction

Transcription:

BigDataBench: An Open source Big Data Benchmark Suite Jianfeng Zhan http://prof.ict.ac.cn/bigdatabench Professor, ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences WBDB 2015 Toronto, Canada INSTITUT TE OF COMPUTING TECHNOLOGY

What is BigDataBench? An open source big data benchmarking project http://prof.ict.ac.cn/bigdatabench Search Google using BigDataBench

Why BigDataBench? Specifi Application Workload Work Scalable data Multiple Multi Sub Simulat cation domains Types loads sets s (from real implemen pe e tenancy sets s or data) tations version BigDataBench Y Five Five 33 8 Y Y Y Y(3) BigBench Y One Three 10 3 N N N N CloudSuite N N/A Two 8 3 N N N Y(1) HiBench N N/A Two 10 3 N N N N CALDA Y N/A One 5 N/A Y N N N YCSB Y N/A One 6 N/A Y N N N LinkBench Y N/A One 10 1 Y N N N AMP Benchmarks N N/A One 4 N/A Y N N N

BigDataBench evolution 2014.12 5 application domains: 14 data sets and 33 workloads Same specifications: diverse implementations Multi tenancy version BigDataBench subset and simulator version BigDataBench 3.1 2014.4 Multidisciplinary effort 32 workloads: diverse implementations BigDataBench 3.0 2013.12 Typical Internet service domains An architectural perspective 19 workloads & data generation tools BigDataBench 2.0 2013.7 Search engine 6 workloads BigDataBench 1.0 11 data analytics workloads DCBench 1.0 Mixed data analytics workloads CloudRank 1.0

BigDataBench Users http://prof.ict.ac.cn/bigdatabench/users/ fi /Bi B h/ / Industry users Accenture, BROADCOM, SAMSUMG, Huawei, IBM About 20 academia groups published papers using BigDataBench

Industry Standard: BigDataBench DCADCA China s first industry standard big data benchmark suite http://prof.ict.ac.cn/bigdatabench/industrystandard benchmarks/ Telecom Research Institute of Ministry of Industry and Information Technology, ICT, CAS, Huawei, China Mobile, Sina, ZTE, Intel (China), Microsoft (China), IBM CDL, Baidu, INSPUR, ZTE, 21viane and UCloud

BigDataBench Publications BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA 2014). Characterizing data analysis workloads in data centers. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award) BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014) BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014) Identifying i Dwarfs Workloads in Big Data Analytics arxivpreprint i arxiv:1505.06872 BigDataBench MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads arxiv preprint arxiv:1504.02205

Outline BigDataBench 3.1 Overview Dwarf Workloads in Big Data Analytics Insights

What s New in BigDataBench 31 3.1 Methodology New application domains Specification for each domain Implementation Specificpurpose Version New real world data sets New big data workloads Multi tenancy version BigDataBench B subset version

What is new? Application Domains Previous version Internet service: search engine, social network, e commerce Newly added: bioinformatics, multimedia Specification Guidelines for benchmark implementation

Methodology evolution Typical Application Domains Data BigDataBench 3.0 Complexity Functions of Abstraction Scalable data sets and workloads System BigDataBench p charististics 30 3.0 Different implementations BigDataBench 3.1 Application Domain 1 Application Domain Application Domain N Data models of different types & semantics Data operations & workload patterns Benchmark specification i 1 Benchmark specification Benchmark specification N Real world data sets Data generation tools Workloads with diverse implementations Multi tenancy version Mix with different percentages Reduce benchmarking cost BigDataBench subset

Specification Search Search Engine General search and vertical search Online server and Offline analytics

Parsing: Search Engine: Parsing Extract the text content and links from the raw web pages

Indexing Search Engine: Indexing create the mapping of terms to document id lists

Search Engine: PageRank PageRank Compute the importance of the page according to the web link graph using PageRank

Search Engine: Search query Querying online web search server serving users requests

Sorting Search Engine: Sorting Sort the results according the page pg ranks and the relevance of between the query and the document

Search Engine: Recommendation Recommendation Recommend related queries to users by mining the search log

Search Engine: Statistic counting Statistic counting Countingthe the word frequency to extract the key words which represent the features of the page

Search Engine: Classification Classification Classify text content into different categories.

Search Engine: Filter & Semantic Filter Identify pages with specific topic which can be used for vertical search extraction ti Semantic extraction extract semantic information

Search Engine: Data access Data access operations Read, write, and scan the semantic information.

Specification Social network Data sets User table Relation table Article table Workloads Offline analytics

Social network: Data schema User table Relation table Tweet table

Social network: Workloads Hot review topic Select the top N tweets by the number of review Hot ttransmit ittopic Select the tweets which are transmitted more than N times. Active user Select the top N person who post the largest number of tweets. Leader of opinion Select top ones whose number of review and transmit are both large than N.

Social network: Workloads Topic classify Classify the tweets to certain category according to the topic. Sentiment classify Classify the tweets to negative or positive according to the sentiment. Friend recommendation Recommend friend to person according the relational graph. Community detection Detecting clusters or communities in large social networks. Breadth first search Sort persons according to the distance between two people.

Specification: E commerce Order table Item table

E commerce Data sets Order table Item table Workloads Offline analytics

E commerce: Workloads Sl Select query Find the items of which the sales amount is over 100 in a single order. Aggregation query Count the sales number of each goods. Join query Count the number of each goods that each buyer purchased between certain period of time. Recommendation Predict the preferences of the buyer and recommend goods to them. Sensitive classification Identify positive or negative review. Basic dt data operation Unit of operation of the data The workloads of select, aggregation, g and join are similar to the queries in A. Pavlo s sigmod09 pp paper,but are specified in the e commence environment

Specification Multimedia Voice Data Extraction Speech Recognition Video Data MPEG Decoder Frame Data Extraction Feature Extraction Image Segmentation Face Detection Three Dimensional Reconstruction Tracing

Multimedia: Workloads MPEG Decoder. Decode video streams using MPEG 2 standard. Feature extraction Fora given video frame, extract features which are invariant to scale, noise, and illumination. Speech Recognition. For a given audio file, recognize the content of the file and find whether there are sensitive words.

Ray Tracing. Multimedia: Workloads Render a 2 Dimensional video frame to a 3 Dimensional scene. Image Segmentation. Segmentthe the input video frame according to color, intensity, and texture, and extract concerned regions. Face Dt Detection. ti Detect whether face exists in the input data, if exists, then extract the face. Deep Learning. The input imagesare are classified into different categories, and then detect human face.

Specification Bioinformatics Sequence assembly. Assemble scattered and repetitive DNA fragments to originallong long sequence. Sequence alignment. Align assembled DNA sequence to known sequences in the database, and detect disease. Gene Sequencing Genome Sequence Data Sequence Assembly Sequence Mapping Sequence Alignment Detection Result

What s New in BigDataBench 31 3.1 Methodology New application domains Specification for each domain Implementation Specificpurpose Version New real world data sets New big data workloads Multi tenancy version BigDataBench B subset version

BigDataBench 3.1 Overview BDGS(Big Data Generator Suite) for scalable data Wikipedia Entries Amazon Movie Reviews Google Web Graph Facebook Social NetworkE commerce Transaction ImageNet English broadcasting audio ProfSearch Resumes DVD Input Streams Image scene SoGou Data Genome sequence data Assembly of the human genome MNIST 14 Real world Data Sets Search Engine Multimedia Social E-commerce Network Bioinformatics 33 Workloads NoSql Impala Shark Hadoop RDMA MPI DataMPI Software Stacks

What s New in BigDataBench 31 3.1 Methodology New application domains Specification for each domain Implementation Specificpurpose Version New real world data sets New big data workloads Multi tenancy version BigDataBench B subset version

BigDataBench Multi tenancy Version Scenarios of multiple tenants running heterogeneous applications in cloud datacenters Latency critical online services Latency insensitive offline batch applications Benchmarking scenarios Mining real world Workload traces (Google and Facebook) Profiling Realworld Workload traces Workload matching using Machine learning techniques Parametric workload generation tool Mixed workloads in public clouds Data analytical workloads in private clouds

BigDataBench Subset Motivation It is expensive to run all the benchmarks for system and architecture researches multiplied by different implementations BigDataBench 3.0 provides about 77 workloads

Methodology of Subsetting Identify a comprehensive set of workload characteristics from a specific perspective Eliminatethe the correlation data in those metrics Map the high dimension metrics to a low dimension Use the clustering method to classify Choose representative tti workloads from each category

Outline BigDataBench 3.1 Overview Dwarf Workloads in Big Data Analytics Insights

Why Dwarf Workloads in Big Data One attempt Ui Using popular workloads Subjective Another attempt Analytics How to define a representative big data benchmark? Fail to cover wide coverage and great complexity of big data Specific domains of systems Little analysis about representativen ess of workloads How can we construct a benchmark suite using a minimum set of units of computation to represent diversity of big data analytics? Dwarf workloads!

Significance A minimum set of necessary functionality A direction for evaluation and performance optimization A highly abstraction of patterns Dwarf Workloads Strong expression power http://www.krellinst.org/doecsgf/conf/2014/pres/jhill.pdf pdf http://cacs.usc.edu/education/cs596/davidpatterson.pdf

Inspiration Successful Compute Abstractions Relational algebra 5 primitive operations Select, Project, Product, Unio n, Difference Successful Benchmarks TPC C C OLTP domain Functions of abstraction Parallel lcomputing HPCC Computational & High performance computing communication patterns Seven basically tests 13 dwarfs

Fundamental Issues What is the dwarf workloads in big data analytics? How to find?

Challenges #1 Massive Application Domains Big Data Massive application domains make us wonder where to start or how to achieve a wide range of coverage

Challenges #2 Multiple Research Fields and Techniques Deep learning Data mining Classification Machine learning Computer vision Clustering Data Mining Dimension reduction Natural language processing Signal processi ng Regression Association rule mining Many techniques for processing big data exist, which bring greater complexity for identifying dwarfs workloads

Challenges #3 Large Numbers of Algorithms and Variants A machine learning library scikitlearn, implements so many algorithms, which is still much less than the total number of algorithms http://scikit learn.org/stable/tutorial/machine_learning_map/

Challenge #4 Unstructured Data with Complicated Operations 80% data growth are unstructured data Operations on big data are complicated Pipeline? Parallel? https://www.capgemini.com/blog/capping it off/2014/07/are you effectively using big data

Methodology of Dwarfs Application Big Data Analytics Representative Domain 1 Algorithms Processing Techniques Application Domain 2 Application Domain N Machine learning Data mining Deep learning Computer vision Natural language processing... Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Benchmarks BigBench, LinkBench... Dwarfs M

Internet Services Search Engine Electronic Commerce Others 15% 5% Social Network Media Streaming 40% Taking up 80% of internet services according to page views and daily visitors 15% 25% Top 20 websites http://www.alexa.com/topsites/global;0

The Explosive Growth of Multimedia Data new VIDEOS on YouTube every minute hours MUSIC streaming on PANDORA every minute new PHOTOS on FLICKR every minute VIDEO feeds from minutes VOICE calls on are surveillance cameras Skype every minute uments, http://www.oldcolony.us/wp content/uploads/2014/11/whatisbigdata DKB v2.pdf l / t t/ l d /2014/11/ h ti d t 2df data growth IMAGES, VIDEOS, doc

The Explosive Growth of Bioinformatics Data DDBJ/EMBL/GenBank database Growth Nucleotides Entries Nucleot tides (billion) 200 180 160 140 120 100 80 60 40 180 160 140 120 100 80 60 40 Entr ries (million) 20 20 0 0 http://www.ddbj.nig.ac.jp/breakdown_stats/dbgrowth e.html#dbgrowth graph

The Explosion Growth of Astronomy The last decade Data hundreds of terabytes of astronomical data for hundreds of millions of sources The next decade Large Synoptic Survey Telescope (LSST) One 6 Gigabyte 30 Terabytes image every every night 20 seconds for 10years 100 Petabyte final image data archive anticipated http://www.samsi.info/sites/default/files/borne_september2012.pdf

Chosen Application Domain Nucleotides (b billion) Search Engine Social Network Electronic Commerce Media Streaming 200 150 100 Others 15% 5% 40% Internet 15% tservice Search engine, Social network, E commerce 25% 50 Top 20 websites DDBJ/EMBL/GenBank database Growth Nucleotides 0 Bioinformatics 150 100 50 0 lion) Entries (mill new VIDEOS on YouTube every minute VIDEO feeds from surveillance cameras hours MUSIC streaming on PANDORA every minute minutes VOICE calls on Skype every minute new PHOTOS on FLICKR every minute Multimedia data growth are IMAGES, VIDEOS, d ocuments, Large Synoptic Survey Telescope (LSST) One 6 Gigabyte image every 20 seconds 30 Terabytes every night for 10 years Astronomy 100 Petabyte final image data archive anticipated

Methodology of Dwarfs Application Big Data Analytics Representative Domain 1 Algorithms Processing Techniques Application Domain 2 Application Domain N Machine learning Data mining Deep learning Computer vision Natural language processing... Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Benchmarks BigBench, LinkBench... Dwarfs M

Big Data Analytics Structured Semi Structured Unstructured Text Graph Table Multimedia Data Model Semantics Processing Techniques Open Source Projects Data Mining Machine Learning Libraries Frameworks Benchmarks

Libraries & Frameworks & Benchmarks Libraries Opencv, Mlib, lb Weka, AstroML Frameworks Spark, Hadoop, Graphlab Benchmarks BigDataBench, Linkbench

Methodology of Dwarfs Application Domain 1 Application Domain 2 Application Domain N Big Data Analytics Processing Techniques Machine learning Data mining Deep learning Computer vision Representative Algorithms Frequently appearing operations Dwarfs 1 Natural language processing... Dwarfs 2 Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Different combinations Benchmarks BigBench, LinkBench... Dwarfs M

Algorithms Chosen to Investigate Computer Vision Bioinformatics Deep Learning MPEG 2, Scale invariant feature transform, Image segmentation, Ray Tracing Needleman Wunsch, Smith Waterman, BLAST CNN, DBN Data Mining & Machine Learning Natural Language Processing Recommendation C4.5/CART/ID3, Logistic regression, SVM, KNN, HMM, Maximumentropy markov model, Conditional random field, PageRank, HITS, Aporiori, FPgrowth, Principal component analysis, Back Propagation, Adaboost, MCMC, Connect edcomponent, Random forest Latent semantic indexing, plsi, Latent dirichlet allocation, Index, Porter Stemming, Sphinx speech recognition Demographic/Content based recommendation, Collaborative Filtering

Frequently appearing Operations Similarity Analysis KNN, K means, Recommendation, Feature matching, Image segmentation Neural Network Back propagation, CNN, DBN, Neural network Linear Algebra Multimedia Representation Sphinx speech recognition, MPEG 2, SIFT, Image segmentation, Ray tracing Matrix/Vector Calculation SVM, HMM, MEMM, CRF, PageRank, HITS, Logistic regression

Frequently appearing Operations(cont ) Similarity Analysis Jaccard similarity Locality sensitive hashing Set Operations Association Rules Mining Aporiori FP Growth Database St Set union Set difference

Frequently appearing Operations(cont ) Sampling MCMC, LDA, Rando m sampling, Downsam pling Transform Operation FFT, Convolution computation, DCT, S peech recognition, MPEG 2 Graph Operation BFS, DFS, Decision tree, Connected Component Logic Operation Encrpytion, index, Fin gerprint, SimHash, Mi nhash

Frequently appearing Operations(cont ) Two more primitive operations Statistic Operation Probability calculation LSI, plsi, Latent dirichlet allocation Sort Partial sort, quick sort, Top k sort K means, Decision tree

Methodology of Dwarfs Application Big Data Analytics Representative Domain 1 Algorithms Processing Techniques Application Domain 2 Application Domain N Machine learning Data mining Deep learning Computer vision Natural language processing... Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Benchmarks BigBench, LinkBench... Dwarfs M

Finalized 8 Dwarfs workloads Linear Algebra Sort Sampling Transform operation Graph operation Logic operation Set operation Statistic operation

Properties Composability Algorithm can be composed of one or more dwarfs Dwarfs (one or more) Flow control Algorit hm Irreversibility combination is sensitive to the order of dwarfs Dwarfs A Dwarfs B Dwarfs B Dwarfs A Uniqueness Represent different patterns

Outline BigDataBench 3.1 Overview Dwarf Workloads in Big Data Analytics Insights

System Behaviors Diversified system level behaviors: 100% 80% 60% 40% 20% CPU utilization I/O wait ratio Percentage High CPU utilization & less I/O time Low CPU utilization relatively and lots of I/O time Mdi Medium CPU utilization 0% H-Grep(7) S-Kmeans(1) S-PageRank(1) H-Wor dcount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H- H-Diff I-Select S-Wor S- -Read(10) ference(9) tquery(9) dcount(8) -Project(4) S-OrderBy(3) H S-Grep(1) M-Grep -TPC-DS- I-OrderBy(7) S S -TPC-DS- -TPC-DS- S-Sort(1) M-WordCount M-Sort AVG_S_BigData 100 10 time and I/O 1 0.1 0.01 H-Grep(7) S-Kmeans(1) S-PageRank(1) H-WordCount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H-Read(10) H-Difference(9) I-SelectQuery(9) S-WordCount(8) S-Project(4) S-OrderBy(3) S-Grep(1) M-Grep H-TPC-DS- I-OrderBy(7) S-TPC-DS- S-TPC-DS- S-Sort(1) M-WordCount M-Sort AVG_S_BigData Weighted I/O Weighted disk I/O time ratio

Workloads Classification Finding from system behaviors System behaviors vary across different workloads Workloads can be divided into 3 categories:

IPC of BigDataBench vs. other benchmarks 2 1.5 1 IPC 0.5 0 H-Grep(7) S-Kmeans(1) S-PageRank(1) H-WordCount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H-Read(10) H-Difference(9) I-SelectQuery(9) S-WordCount(8) S-Project(4) S-OrderBy(3) S-Grep(1) M-Grep H-TPC-DS-query3(9) I-OrderBy(7) S-TPC-DS- S-TPC-DS-query8(1) S-Sort(1) M-WordCount M-Sort AVG_S_BigData TPC-C AVG_CloudSuite Avg_HPCC Avg_PARSEC AVG_SPECfp AVG_SPECint The average IPC of the big data workloads larger than CloudSuite, SPECFP and SPECINT, same like PARSEC and slightly smaller than HPCC The avrerage IPC of BigDataBench is 1.3 times of that of CloudSuite Some workloads have high IPC (M_Kmeans, S TPC DS Query8)

Instructions Mix of BigDataBench vs. other benchmarks Big data workloads are data movement dominated dcomputing with ihmore branch operations 92% percentage in terms of instruction mix (Load + Store + Branch + data movements of INT)

Cache Behaviors of BigDataBench vs. other benchmarks CPU intensive I/O intensive hybrid L1I MPKI Larger than traditional benchmarks, but lower than that of CloudSuite (12 Vs. 31) Different among big data workloads CPU intensive(8), I/O intensive(22), and hybrid workloads(9) One order of magnitude differences among diverse implementations M_WordCount is 2, while H_WordCount is 17

Cache Behaviors L2 Cache: The IO intensive workloads undergo more L2 MPKI L3 Cache: The average L3 MPKI of the big data workloads is smaller than all of the other workloads The underlying software stacks impact data locality MPI workloads have better data locality and less cache misses

Locality Instructions Cache miss ratio versus Cache size Data Cache miss ratio versus Cache size. Hadoop workloads have larger instructions footprint

Our observation from BigDataBench Unique characteristic data movement dominated computing with more branch operations 92% percentage in terms of instruction mix Locality Hadoop workloads have larger instructions footprint Different behaviors among Big Data workloads Disparity of ILP and memory access behaviors CloudSuite is a subclass of Big Data Software stacks impacts The L1I cache miss rates have one order of magnitude differences among diverse implementations with different software stacks.