BPOE Research Highlights Jianfeng Zhan ICT, Chinese Academy of Sciences 2013-10- 9 http://prof.ict.ac.cn/jfzhan INSTITUTE OF COMPUTING TECHNOLOGY
What is BPOE workshop? B: Big Data Benchmarks PO: Performance OpHmizaHon E: Emerging Hardware
MoHvaHon Big data cover many research fields. Researchers Specific research fields + main conferences Have few chance to know about each other Gap between Industry and Academia bringing researchers and prachhoners in related areas together
BPOE communihes CommuniHes of architecture, systems, and data management. discuss the mutual influences of architectures, systems, and data management. Bridge the gap of big data researches and prachces between industry and academia
BPOE: a series of workshops 1 st BPOE in conjunchon with IEEE Big Data 2013 Santa Clara, CA, USA Paper presentahons + 2 Invited talks 2 st BPOE in conjunchon with CCF HPC China 2013, Guilin, Guangxi, China 8 invited talks 3 st BPOE in conjunchon with CCF Big Data Technology Conference 2013 Beijing, China 6 invited talks
OrganizaHon Steering commi6ee Lizy K John University of Texas at AusHn Zhiwei Xu ICT, Chinese Academy of Sciences Cheng- zhong Xu, Wayne State University Xueqi Cheng, ICT, Chinese Academy of Sciences Jianfeng Zhan, ICT, Chinese Academy of Sciences Dhabaleswar K Panda Ohio State University PC Co- chairs Jianfeng Zhan, ICT, Chinese Academy of Sciences Weijia Xu, TACC, University of Texas at AusHn
Paper submissions 26 papers received Each one has at least five reviews 30 TPC members 16 paper accepted Two invited talks
Finalized programs Three sessions Performance ophmizahon of big data systems 3 from Intel, 1 from Hasso Plabner InsHtute, Germany Special Session: Big Data Benchmarking and Performance ophmizahon. 2 invited talks+ 1 from Academia Sinica Experience and evaluahon with emerging hardware for big data 1 from Japan, 3 from US
Two invited talks 13:20-14:05 Invited Talk, BigDataBench: Benchmarking big data systems, Yingjie Shi, Chinese Academy of Sciences 14:05-14:50 Invited Talk, Facebook: Using Emerging Hardware to Build Infrastructure at Scale, Bill Jia, PhD. Manager, Performance and Capacity Engineering, Facebook
Related work with Big Data BigDataBench Benchmarking Uses case on BigDataBench
What is BigDataBench? A Big Data Benchmark Suite, ICT, Chinese Academy of Sciences PresentaHon is available soon from hbp://prof.ict.ac.cn/bpoe2013/program.php EvaluaHng big data (hardware) systems and architectures Opensource project hbp://prof.ict.ac.cn/bigdatabench
Summary of BigDataBench
BigDataBench Methodology Representative Real Data Sets Data Types Structured data Semi-structured data Unstructured data Data Sources Text data Graph data Table data Extended Big Data Sets Preserving 4V Investigate Typical Application Domains Synthetic data generation tool preserving data characteristics BigDataBench Diverse and Important Workloads Application Types Offline analytics Realtime analytics Online services Basic & Important Operations and Algorithms Extended Represent Software Stack Extended Big Data Workloads
RepresentaHve Datasets Application Domain Data Type Data Source Dataset Search Engine E-commence unstructured data Text data Wikipedia Entries Semi-structured data unstructured data Graph data Table data Text data Google Web Graph Profsearch Person Resume Amazon Movie Reviews structured data Table data ABC Transaction Data Social Network unstructured data Graph data Facebook Social Graph
Chosen Workloads Micro Benchmarks Basic Datastore Operations Application Scenarios Relational Queries Search engine Social network Ecommerce system
Chosen Workloads Micro Benchmarks Application Domain Data Type Data Source Operations & Algorithms Software Stack Benchmark ID sort Hadoop Spark 1-1 1-2 MPI 1-3 Micro Benchmarks un-structured text grep Hadoop Spark 2-1 2-2 MPI 2-3 wordcount Hadoop 3-1 Spark 3-2 MPI 3-3 graph BFS MPI 4
Chosen Workloads Basic Datastore Application Domain Basic Datastore Operations Data Type semistructured Operations Data Source table Operations & Algorithms Read Write Scan Software Stack HBase Cassandra MongoDB MySQL HBase Cassandra MongoDB MySQL HBase Cassandra MongoDB MySQL Benchmark ID 5-1 5-2 5-3 5-4 6-1 6-2 6-3 6-4 7-1 7-2 7-3 7-4
Chosen Workloads Basic Relational Query Application Domain Data Type Data Source Operations & Algorithms Software Stack Benchmark ID Basic Relational Query structured Table Select query Aggregation query Join query Hive Impala Hive Impala Hive 8-1 8-2 9-1 9-2 10-1 Impala 10-2
Chosen Workloads Service Application Domain Operations & Algorithms Data Type Data Source Software Stack Benchmark ID Search Engine Nutch server Pagerank Structured Un-structured Table Graph Hadoop Hadoop 11 12 Index Un-structured text Hadoop 13 Social Network Olio server Kmeans Structured Un-structured Table Graph MySQL Hadoop Spark 14 15-1 15-2 Connetcted components Un-structured graph Hadoop 16 E-commerce Rubis server Collaborative filtering Structured Un-structured Table text MySQL Hadoop 17 18 Naïve bayas Un-structured text Spark 19
BigDataBench Related with Big Data Benchmarking Uses case on BigDataBench
Case Studies based on BigDataBech The Implica,ons from Benchmarking Three Different Data Center Pla:orms. Q. Jing, Y. Shi and M. Zhao University of Science and Technology of China, and Florida International University AxPUE: Applica,on Level Metrics for Power Usage Effec,veness in Data Centers. R. Zhou, Y. Shi, C. Zhu, F. Liu. NaHonal Computer network Emergency Response Technical Team CoordinaHon Center of China, China An Ensemble MIC- based Approach for Performance Diagnosis in Big Data Pla:orm. P. Chen, Y. Qi, X. Li, and L. Su. Xi'an Jiaotong University, China A Characteriza,on of Big Data Benchmarks. W. Xiong, Z. Yu., C. Xu, SIAT, Chinese Academy of Sciences, and Wayne State University
New Solutions of Big Data Systems 22/
A Tradeoff? Energy consumption 23/ Performance 23
" What is the performance " Evaluating three of different big data respective systems big under data types of applications? systems " Comparing two of them " What is the performance from performance of and different big data energy systems cost under different data volumes? " Analyzing the running " What is the energy features consumption of different big of different big data data system, systems? and the underlying reasons
Experiment Planorms Xeon Mainstream processor Atom Low power processor Tilera Many core processor Basic Comparison Configurations Hadoop Cluster Xeon VS Atom Xeon VS Tilera InformaXon CPU Type Intel OoO Xeon E5310 Intel ConnecXon Atom D510 Buffer Tilera TilePro36 Master/Slaves FPU 1/7 1/7 and 1/1TDP ExecuXon Mode Sharing CPU Core 4 cores All the @ 1.6GHz comparison 2 cores experiments @ 1.66GHz are 36 based cores on @ 500MHz Xeon E5310 Yes Having Yes the Comprison BigDataBench same BUS Having No the same 80W core hardware thread number number L1 I/D Cache 32KB 24KB 16KB/8KB Atom D510 No Yes BUS No 13W TilePro36 L2 Hadoop Cache seqngyes 4096KB NoFollowing IMESH 512KB the Hadoop official Yes website 64KB16W
ImplicaHons from the Results Xeon vs. Atom Xeon is more powerful than Atom Atom is energy conservation than Xeon when dealing with some easy application Atom doesn t show energy advantage when dealing with complex application Xeon vs. Tilera Xeon is more powerful than Tilera Tilera is more energy conservation than Xeon when dealing with some easy application Tilera don t show energy advantage when dealing with complex application Tilera is more suitable to process I/O intensive application
Case Studies based on BigDataBech The Implica,ons from Benchmarking Three Different Data Center Pla:orms. Q. Jing, Y. Shi and M. Zhao University of Science and Technology of China, and Florida International University AxPUE: Applica,on Level Metrics for Power Usage Effec,veness in Data Centers. R. Zhou, Y. Shi, C. Zhu, F. Liu. NaHonal Computer network Emergency Response Technical Team CoordinaHon Center of China, China An Ensemble MIC- based Approach for Performance Diagnosis in Big Data Pla:orm. P. Chen, Y. Qi, X. Li, and L. Su. Xi'an Jiaotong University, China A Characteriza,on of Big Data Benchmarks. W. Xiong, Z. Yu., C. Xu, SIAT, Chinese Academy of Sciences, and Wayne State University
Greening Data Center IDC says: Digital Universe will be 35 Zettabytes by 2020 Nature says: Distilling the meaning from big data has never been in such urgent demand. The data centers consumed about 1.3% electricity of all the electricity use The energy bill is the largest single item in the total cost of ownership of a Data Center
Power Usage EffecHveness If you can not measure it, you can not improve it. Lord Kelvin PUE(Power usage effec,veness): a measure of how efficiently a computer data center uses its power; specifically, how much of the power is actually used by the informahon technology equipment.
AxPUE PUE
ApPUE ApPUE (Application Performance Power Usage Effectiveness): a metric that measures the power usage effectiveness of IT equipments, specifically, how much of the power entering IT equipments is used to improve the application performance. Computation Formulas: Data processing performance of applications ApPUE = Application Performance IT Equipment Power The average rate of IT Equipment Energy consumed
AoPUE AoPUE (Application ): a metric that measures the power usage effectiveness of the overall data center system, specifically, how much of the total facility power is used to improve the application performance. Computation Formulas: AoPUE = Application Performance Total Facility Power AoPUE = ApPUE PUE The average rate of Total Facility Energy Used
The Roles of BigDataBench ConduHng the experiments based on BigDataBench to demonstrate the rahonality of the newly proposed AxPUE from two aspects: AdopHng the comprehensive workloads of BigDataBench to design the applicahon category sensihve experiment. AdopHng Sort of BigDataBench to design the algorithm complexity- sensihve experiment.
Case Studies based on BigDataBech The Implica,ons from Benchmarking Three Different Data Center Pla:orms. Q. Jing, Y. Shi and M. Zhao University of Science and Technology of China, and Florida International University AxPUE: Applica,on Level Metrics for Power Usage Effec,veness in Data Centers. R. Zhou, Y. Shi, C. Zhu, F. Liu. NaHonal Computer network Emergency Response Technical Team CoordinaHon Center of China, China An Ensemble MIC- based Approach for Performance Diagnosis in Big Data Pla:orm. P. Chen, Y. Qi, X. Li, and L. Su. Xi'an Jiaotong University, China A Characteriza,on of Big Data Benchmarks. W. Xiong, Z. Yu., C. Xu, SIAT, Chinese Academy of Sciences, and Wayne State University
MoHvaHon & ContribuHons MoXvaXon The properties of big data bring challenges for big data management. The performance diagnosis is of great importance to provide healthy big data systems. ContribuXons Propose a new performance anomaly detection method based on ARIMA model for big data applications. Introduce a signature-based approach employing MIC invariants to correlate a specific kind of performance problem. Propose an ensemble approach to diagnose the real causes of performance problems in big data platform.
The Roles of BigDataBench ConducHng the experiments based on BigDataBench to evaluate the efficiency and precision of proposed performance anomaly detechon method. Using the data generahon tool of BigDataBench to generate experiment data. Chosen workloads: Sort, Wordcount, Grep and Naïve Bayesian.
Case Studies based on BigDataBech The Implica,ons from Benchmarking Three Different Data Center Pla:orms. Q. Jing, Y. Shi and M. Zhao University of Science and Technology of China, and Florida International University AxPUE: Applica,on Level Metrics for Power Usage Effec,veness in Data Centers. R. Zhou, Y. Shi, C. Zhu, F. Liu. NaHonal Computer network Emergency Response Technical Team CoordinaHon Center of China, China An Ensemble MIC- based Approach for Performance Diagnosis in Big Data Pla:orm. P. Chen, Y. Qi, X. Li, and L. Su. Xi'an Jiaotong University, China A Characteriza,on of Big Data Benchmarks. W. Xiong, Z. Yu., C. Xu, SIAT, Chinese Academy of Sciences, and Wayne State University
Main Ideas Characterize 16 various typical workloads from BigDataBench and HiBench by micro- architecture level metrics. Analyze the similarity in these various workloads by stahshcal techniques such as PCA and clustering. Release two typical workloads related to trajectory data process in real- world applicahon domain.
Contact informahon Jianfeng Zhan hbp://prof.ict.ac.cn/jfzhan zhanjianfeng@ict.ac.cn BPOE: hbp://prof.ict.ac.cn/bpoe2013 BigDataBench: hbp://prof.ict.ac.cn/bigdatabench